This repo contains the combined clinical data with follow up and outcome data for the TCGA PanCancer Atlas in a sinlge text file and RData file. All data in its original format can be found at https://gdc.cancer.gov/about-data/publications/pancanatlas. All original files had been previously downloaded in June of 2018.
The original flagship paper (Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer by Hoadley et al) represented efforts to provide “comprehensive integrative molecular analyses of the complete set of tumors in TCGA”. This paper was accompanied by a paper from Liu et al (An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics) who attempted to create a standardized dataset for the clinical data across the PanCancer Atlas called the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR). This was necessary as the original TCGA project consisted of two parts 1) the pilot study which focused on GBM, OV, and LUSC and 2) the full project which encompassed 33 cancer types. Due to the relatively short time frame of establishing TCGA (2006-2015) clinical data is often limited and follow-up times are short. Furthermore, the data fields available differ by cancer type as each type had a Disease Working Group who decided what data to collect.
Thus, the PanCancer Atlas provides two data resources for clinical
annotations: clinical_PANCAN_patient_with_followup.tsv
which is
available under Additional Resources/Supplemental Data and is not
assocaited with a specific publication and details on it’s creation are
scarce and TCGA-CDR-SupplementalTableS1.xlsx
which is described as a
“curated resource of the clinical annotations for TCGA data and
provides recommendations for use of clinical endpoints” and comed with
the recommendation that “this file be used for clinical elements and
survival outcome data first” and is associated with the Liu et al
publication.
Additional details on how the dataset was created can be found in the orginial article.
“For clinical outcome endpoints, we recommend the use of PFI for progression-free interval, and OS for overall survival. Both endpoints are relatively accurate. Given the relatively short follow-up time, PFI is preferred over OS. Detailed recommendations please refer to Table 3 in the accompanying paper.”
All survival times are measured in days.
A total of 209 patients are found in TCGA-CDR-SupplementalTableS1
but
not clinical_PANCAN_patient_with_followup
. IDs are available below in
the missing data section.
Survival Endpoints:
- OS - overall survival
- DSS - disease-specific survival
- DFI - disease-free interval
- PFI - progression-free interval
There is a strange occurence for a small subset of patients who have secondary endpoints which occurr after their overall survival endpoint (ie progression occurs after last followup). Patients with later DFI: NA, TCGA-15-1444, TCGA-BR-8380, TCGA-EO-A3AY. Patients with later PFI: NA, TCGA-UY-A9PE, TCGA-OL-A97C, TCGA-15-1444, TCGA-CW-6090, TCGA-HW-7491, TCGA-BW-A5NP, TCGA-20-0990, TCGA-BF-A1PU, TCGA-DA-A1HW, TCGA-BR-8380, TCGA-EO-A3AY. These numbers correspond with the data in cBioportal, so it is not specific to this dataset. These and a handful of other inconsistencies that have been noted are explained in further detail in the original paper.
Supplemental Endpoints:
- PFI.1 - progression-free interval
- PFI.2 - progression-free interval
- PFS - progression-free survival
Competing Risks Endpoints:
- 0 = censored, 1 = event of interest, 2 = competing risk death
- DSS.cr - disease-specific survival with competing risks
- DFI.cr - disease-free interval with competing risks
- PFI.cr - progression-free interval with competing risks
- PFI.1.cr - progression-free interval (PFI.1) with competing risks
- PFI.2.cr - progression-free interval (PFI.2) with competing risks
For a complete description of all survival endpoint please see the the
origianl data file TCGA-CDR-SupplementalTableS1
.
There are now two copies of vital_status
, residual_tumor
, and
margin_status
. The version from
clinical_PANCAN_patient_with_followup
is labeled as such while the
“updated” version from TCGA-CDR-SupplementalTableS1
(Liu et al) is
labeled as liu_vital_status
, liu_residual_tumor
, and
liu_margin_status
.
There were 10956 unique patint barcodes in
clinical_PANCAN_patient_with_followup
and 11160 unique patient
barcodes from TCGA-CDR-SupplementalTableS1
, 10951 of which overlap,
which may account for some missingness after the two were merged, as all
patients from wither set were retained. Individual barcodes for those
missing in each dataset can be found below.
Using vital_status
there are 7468 patients alive and 3483 who have
deceased; using liu_vital_status
there are 7528 patients alive and
3627 deceased; using OS
there are 7529 alive (0) and 3622 deceased
(1).
PanCancer Atlas data is also available for download at cBioPortal and github, however the data has been split into individual cancer types. The cBioPortl page points users back to the PanCancer Publications page, and the combined PanCancer Atlas data has data available for 33 cancer types (including LAML) and 10,953 pateints over 10,967 samples.
When using cBioPortal, to access the full PanCancer data use the combined types.
When looking at cBioPortal vs PanCancer Atlas data one thing to keep in mind is that cBioPortal reports times to events in terms of months but doesn’t use 30.5 days in a month but instead something closer to 30.4 days in a month. Below are the overall survival time (in months) for the full PanCancer cohort so you cna see the minor differences.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.46 23.34 34.03 44.85 368.92 65
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.49 23.41 34.12 44.97 369.92 65
There are also differences in completeness of the data between the two.
For example, looking within the prostate cancer cases (PRAD) at the
variable tumor type (Tumor Type
in cBioPortal and tumor_type
in
PanCancer), there are 479 cases of acinar type and 15 of other type in
cBioPortal’s data but is completely missing in PanCancer.
In survival but not clinical : TCGA-4G-AAZF, TCGA-4G-AAZG, TCGA-4G-AAZR, TCGA-W5-AA2J, TCGA-W5-AA2K, TCGA-W5-AA2M, TCGA-W6-AA0T, TCGA-ZH-A8Y3, TCGA-ZH-A8Y7, TCGA-AB-2802, TCGA-AB-2803, TCGA-AB-2804, TCGA-AB-2805, TCGA-AB-2806, TCGA-AB-2807, TCGA-AB-2808, TCGA-AB-2809, TCGA-AB-2810, TCGA-AB-2811, TCGA-AB-2812, TCGA-AB-2813, TCGA-AB-2814, TCGA-AB-2815, TCGA-AB-2816, TCGA-AB-2817, TCGA-AB-2818, TCGA-AB-2819, TCGA-AB-2820, TCGA-AB-2821, TCGA-AB-2822, TCGA-AB-2823, TCGA-AB-2824, TCGA-AB-2825, TCGA-AB-2826, TCGA-AB-2827, TCGA-AB-2828, TCGA-AB-2829, TCGA-AB-2830, TCGA-AB-2831, TCGA-AB-2832, TCGA-AB-2833, TCGA-AB-2834, TCGA-AB-2835, TCGA-AB-2836, TCGA-AB-2837, TCGA-AB-2838, TCGA-AB-2839, TCGA-AB-2840, TCGA-AB-2841, TCGA-AB-2842, TCGA-AB-2843, TCGA-AB-2844, TCGA-AB-2845, TCGA-AB-2846, TCGA-AB-2847, TCGA-AB-2848, TCGA-AB-2849, TCGA-AB-2850, TCGA-AB-2851, TCGA-AB-2853, TCGA-AB-2854, TCGA-AB-2855, TCGA-AB-2856, TCGA-AB-2857, TCGA-AB-2858, TCGA-AB-2859, TCGA-AB-2860, TCGA-AB-2861, TCGA-AB-2862, TCGA-AB-2863, TCGA-AB-2864, TCGA-AB-2865, TCGA-AB-2866, TCGA-AB-2867, TCGA-AB-2868, TCGA-AB-2869, TCGA-AB-2870, TCGA-AB-2871, TCGA-AB-2872, TCGA-AB-2873, TCGA-AB-2874, TCGA-AB-2875, TCGA-AB-2876, TCGA-AB-2877, TCGA-AB-2878, TCGA-AB-2879, TCGA-AB-2880, TCGA-AB-2881, TCGA-AB-2882, TCGA-AB-2883, TCGA-AB-2884, TCGA-AB-2885, TCGA-AB-2886, TCGA-AB-2887, TCGA-AB-2888, TCGA-AB-2889, TCGA-AB-2890, TCGA-AB-2891, TCGA-AB-2892, TCGA-AB-2893, TCGA-AB-2894, TCGA-AB-2895, TCGA-AB-2896, TCGA-AB-2897, TCGA-AB-2898, TCGA-AB-2899, TCGA-AB-2900, TCGA-AB-2901, TCGA-AB-2903, TCGA-AB-2904, TCGA-AB-2905, TCGA-AB-2906, TCGA-AB-2907, TCGA-AB-2908, TCGA-AB-2909, TCGA-AB-2910, TCGA-AB-2911, TCGA-AB-2912, TCGA-AB-2913, TCGA-AB-2914, TCGA-AB-2915, TCGA-AB-2916, TCGA-AB-2917, TCGA-AB-2918, TCGA-AB-2919, TCGA-AB-2920, TCGA-AB-2921, TCGA-AB-2922, TCGA-AB-2923, TCGA-AB-2924, TCGA-AB-2925, TCGA-AB-2926, TCGA-AB-2927, TCGA-AB-2928, TCGA-AB-2929, TCGA-AB-2930, TCGA-AB-2931, TCGA-AB-2932, TCGA-AB-2933, TCGA-AB-2934, TCGA-AB-2935, TCGA-AB-2936, TCGA-AB-2937, TCGA-AB-2938, TCGA-AB-2939, TCGA-AB-2940, TCGA-AB-2941, TCGA-AB-2942, TCGA-AB-2943, TCGA-AB-2944, TCGA-AB-2945, TCGA-AB-2946, TCGA-AB-2947, TCGA-AB-2948, TCGA-AB-2949, TCGA-AB-2950, TCGA-AB-2952, TCGA-AB-2954, TCGA-AB-2955, TCGA-AB-2956, TCGA-AB-2957, TCGA-AB-2959, TCGA-AB-2963, TCGA-AB-2964, TCGA-AB-2965, TCGA-AB-2966, TCGA-AB-2967, TCGA-AB-2968, TCGA-AB-2969, TCGA-AB-2970, TCGA-AB-2971, TCGA-AB-2972, TCGA-AB-2973, TCGA-AB-2974, TCGA-AB-2975, TCGA-AB-2976, TCGA-AB-2977, TCGA-AB-2978, TCGA-AB-2979, TCGA-AB-2980, TCGA-AB-2981, TCGA-AB-2982, TCGA-AB-2983, TCGA-AB-2984, TCGA-AB-2985, TCGA-AB-2986, TCGA-AB-2987, TCGA-AB-2988, TCGA-AB-2989, TCGA-AB-2990, TCGA-AB-2991, TCGA-AB-2992, TCGA-AB-2993, TCGA-AB-2994, TCGA-AB-2995, TCGA-AB-2996, TCGA-AB-2997, TCGA-AB-2998, TCGA-AB-2999, TCGA-AB-3000, TCGA-AB-3001, TCGA-AB-3002, TCGA-AB-3005, TCGA-AB-3006, TCGA-AB-3007, TCGA-AB-3008, TCGA-AB-3009, TCGA-AB-3011, TCGA-AB-3012
In clinical but not survival : TCGA-BH-A0B2, TCGA-E2-A1IP, TCGA-PN-A8M9, TCGA-F5-6810, TCGA-GN-A261
All missing : stage_other, history_of_radiation_metastatic_site,er_estimated_duration_response, er_disease_extent_prior_er_treatment,er_solid_tumor_response_documented_type, er_solid_tumor_response_documented_type_other,er_response_type, history_of_radiation_primary_site,history_prior_surgery_type,patient_progression_status, history_prior_surgery_indicator,history_prior_surgery_type_other,field, molecular_abnormality_results,molecular_abnormality_results_other,death_cause_text,hbv_test, on_haart_therapy_at_cancer_diagnosis,on_haart_therapy_prior_to_cancer_diagnosis, hcv_test,prior_aids_conditions,kshv_hhv8_test,days_to_hiv_diagnosis,hiv_status, hiv_rna_load_at_diagnosis,cdc_hiv_risk_group,history_of_other_malignancy,history_immunosuppresive_dx, nadir_cd4_counts,history_relevant_infectious_dx_other,history_relevant_infectious_dx, cd4_counts_at_diagnosis,history_immunological_disease_other,hpv_test, history_immunosuppressive_dx_other,lost_follow_up, pos_finding_metastatic_breast_carcinoma_estrogen_receptor_other_measuremenet_scale_text, metastatic_breast_carcinoma_pos_finding_progesterone_receptor_other_measure_scale_text, metastatic_breast_carcinoma_her2_erbb_pos_finding_fluorescence_in_situ_hybridization_calculation_method_text, metastatic_breast_carcinoma_her2_erbb_method_calculation_method_text, metastatic_breast_carcinoma_pos_finding_other_scale_measurement_text
All missing or single value : project_code, disease_code, informed_consent_verified, metastatic_breast_carcinoma_progesterone_receptor_level_cell_percent_category, metastatic_breast_carcinoma_immunohistochemistry_pr_pos_cell_score, metastatic_breast_carcinoma_immunohistochemistry_er_pos_cell_score, metastatic_breast_carcinoma_her2_erbb_pos_finding_cell_percent_category