Collection of pan-cancer datasets consisting of various modalities, including medical and clinical records, radiology (CT, MRIs, PET), pathology (H&E and IHC), and omics data (genomics and proteomics) have been compiled below. This is a non-exhaustive collection that is being updated periodically. The purpose of this compilation is to provide the cancer research community with a unified view of the resources available for studying various cancer sites, organs, and modalities. We aim to utilize these resources in our ongoing research and fight against the cancer disease.
Primarily, we have compiled the list of datasets from data portals under the flagship of NIH National Cancer Institute (NCI) that include, The Cancer Imaging Archive (TCIA), Genomic Data Commons (GDC) portal of The Cancer Genome Atlas (TCGA), and Proteomic Data Commons (PDC) portal of Clinical Proteomic Tumor Analysis Consortium (CPTAC). Below is the summary of the datasets available at these portals.
- Study of molecular characterization of over 20,000 primary cancer and matched normal samples spanning 34 cancer types.
- Joint effort between NCI and the National Human Genome Research Institute.
- Over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data.
- Publicly available for research use.
- Genomics data available at the Genomics Data Commons portal, GDC, open access.
- Imaging data available at The Cancer Imaging Archive (TCIA), open access. The radiology and histopathology data of TCIA can be accessed and downloaded through the following portals:
- TCIA Radiology data portal.
- TCIA Histopathology data portal.
- Proteomics data is available through the Proteomic Data Commons (PDC) under the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program.
Below we first present the National Cancer Institute (NCI) data modalities followed by the 32 cancer types and their corresponding datasets, primary publications, number of cases, and modalities. The list is organized by cancer type and then by data modality. The data modalities include clinical, copy number, DNA, imaging, and miRNA, mRNA, and protein expression. The second table below presents the non-NCI dataset resources available for public access. Lastly, we present the list of abbreviations for the cancer study name used in this compilation.
- Clinical
- Clinical data
- Available for all cancer types
- May include demographic information, treatment information,survival data, etc.
- XML (per patient), tab-delimited TXT.
- Additional information in the Clinical Data Elements (CDE) Browser.
- Biospecimen data
- Available for all cancer types
- Information on how samples were processed by the Biospecimen Core Resource Center
- XML (per patient), tab-delimited TXT.
- Additional information in the Clinical Data Elements (CDE) Browser.
- Pathology Reports
- Available for all cancer types
- Pathology reports (for select cases)
- PDF format
- Clinical data
- Copy Number
- SNP microarray
- Available for all cancer types
- Tab-delimited TXT ( normalized values and purity/ploidy)
- Probe information contained in array design files for each platform
- Copy number microarray
- Available for GBM, OV, LUSC
- Tab-delimited TXT (raw signals per probe), tab-delimited TSV (normalized values per aggregated region), MAT.
- Probe information contained in array design files for each platform
- DNA Sequencing
- Available for Some tumor types
- Low pass, whole genome sequencing of tumor and normal matched samples and analysis of differences in read counts between tumor and normal
- tab-delimited TSV (normal vs. tumor cells)
- SNP microarray
- DNA
- Whole exome
- Available for all cancer types
- Whole exome sequencing of tumor and normal matched samples
- VCF, MAF (mutation cells)
- Whole genome
- Available for all cancer types
- Whole genome sequencing for tumor and normal matched samples (for select cases)
- VCF, MAF (mutation cells).
- SNP microarray
- Available for all cancer types
- tab-delimited TXT (genotypes per SNP)
- Whole exome
- Imaging
- Diagnostic image
- Available for all cancer types
- Whole slide images of tissue used to diagnose participant
- SVS
- Available at the GDC, open access
- Tissue image
- Available for all cancer types
- Whole slide images of tissue samples from each participant that were used for TCGA analyses
- SVS
- Available at the GDC, open access
- Radiological image
- Available for some cancer types
- Pre-surgical radiological imaging (e.g. MRI, CT, PET, etc) (for select cases)
- DCM or DICOM format.
- Available at The Cancer Imaging Archive, open access
- Diagnostic image
- miRNA, mRNA, and Protein Expression
- miRNA Sequencing
- Available for all cancer types except GBM
- miRNA sequencing of tumor samples
- tab-delimited TXT (normalized expression values per miRNA or isoform)
- Array-based
- Available for GBM, OV cancer types
- TXT (raw signals per probe, normalized expression values per probe, gene, or exons)
- Probe information contained in array design files for each platform
- mRNA Sequencing
- Available for all cancer types
- mRNA sequencing of tumor samples using a poly(A) enrichment RNA preparation
- TXT (normalized expression values per gene, isoform, exon, or splice junction)
- labeled as RNASeqV1 and RNASeqv2
- Total RNA Sequencing
- Available for some cancer types
- mRNA sequencing of tumor samples ribosomal depletion RNA preparation
- TXT (normalized expression values per gene, isoform, exon, or splice junction)
- labeled as TotalRNASeqV2
- Microarray
- Available for BRCA, COAD, GBM, KIRC, KIRP, LAML, LGG, LUAD, LUSC, OV, READ, UCEC cancer types
- TXT (raw signals per probe, normalized expression values per probe, gene, or exons)
- Probe information contained in array design files for each platform
- Reverse-Phase Protein Array
- Available for all cancer types
- High resolution images of protein array slides (up to 1000 participant tumor samples per slide) and raw signals per slide
- TIFF, tab-delimited TXT (signal values, dilution curves, normalized expression values
- miRNA Sequencing
Organ | Disease | Name | Access | Images | Reference |
---|---|---|---|---|---|
Multiple | Multi | UKBiobank | RC | MRI, DXA | https://www.ukbiobank.ac.uk/ |
Multiple | Multi | Grand-Challenges | OA | Multi-domain | https://grand-challenge.org |
Multiple | Multi | Kaggle | OA | Multi-domain | https://www.kaggle.com |
Multiple | Multi | VISCERAL: Visual Concept Extraction Challenge in Radiology | RC | Multi-domain | http://www.visceral.eu/benchmarks |
Multiple | Multi | Medical Segmentation Decathlon | OA/RC | CT, MRI | http://medicaldecathlon.com |
Brain | Multi | OpenNeuro | OA/RC | Multi-domain | https://openneuro.org |
Brain | Multi | Image and Data Archive (IDA) | OA/RC | s/f/dMRI, CT/PET/SPECT | https://ida.loni.usc.edu |
Brain | Normal, dementia, Alzheimer’s | OASIS Brains Dataset | OA | MRI | https://www.oasis-brains.org |
Brain | Multi | NITRC: NeuroImaging Tools and Resources Collaboratory | OA | s/fMRI | https://nitrc.org |
Brain | TBI | The Federal Interagency TBI Research (FITBIR) | RC | MRI, PET, Contrast | https://fitbir.nih.gov |
Brain | TBI, Stroke | CQ500 | OA/RC | CT | http://headctstudy.qure.ai/dataset |
Brain | Multi | NDA | RC | MRI | https://nda.nih.gov |
Brain | Multi | Connectome | RC | sMRI, fMRI | https://www.humanconnectome.org |
Breast | Cancer screening | MIAS mini-database | OA | MG, US | http://peipa.essex.ac.uk/info/mias.html |
Breast | Cancer screening | BCDR | RC | MG, US | https://bcdr.eu |
Breast | Cancer | DDSM | OA | MG | http://www.eng.usf.edu/cvprg/Mammography/Database.html |
Breast | Cancer | OMI-DB | RC | MG | https://medphys.royalsurrey.nhs.uk/omidb |
Breast | Cancer | INbreast | OA/RC | MG | http://medicalresearch.inescporto.pt/breastresearch/index.php/Get_INbreast_Database |
Cardiac | Clinical routine care | EchoNet-Dynamic | OA/RC | Echocardiogram videos | https://echonet.github.io/dynamic |
Cardiac | Multi-abnormal | CAMUS project | OA/RC | Echocardiogram | https://www.creatis.insa-lyon.fr/Challenge/camus |
Cardiac | Multi | EuCanShare | RC | MRI | http://www.eucanshare.eu |
Cardiac | Multi | Cardiac Atlas Project | OA/RC | MRI | http://www.cardiacatlas.org |
Full body | Healthy, unknown | Visible Human Project (VHP) | OA | CT, MRI | https://www.nlm.nih.gov/research/visible |
Lung | Thorax | NHS Chest X-ray NIHC | OA | X-ray | https://nihcc.app.box.com/v/ChestXray-NIHCC |
Lung | Multi | Cornell Engineering: Vision and Image Analysis lab | OA | CT | http://www.via.cornell.edu/databases |
Lung | COVID19 | MosMedData | OA | CT | https://mosmed.ai/en |
Lung | COVID19 | COVID-19 CT segmentation | OA | CT | http://medicalsegmentation.com/covid19 |
Lung | COVID19 | BIMCV COVID-19 | OA | CT, CXR | https://github.com/BIMCV-CSUSP/BIMCV-COVID-19 |
Lung | COVID19 | COVID-19 Image Data Collection | OA | CT, CXR | https://github.com/ieee8023/covid-chestxray-dataset https://josephpcohen.com/w/public-covid19-dataset/ |
Lung | COVID19 | COVID-19 Chest X-ray Dataset Initiative | OA | CXR | https://github.com/agchung/Figure1-COVID-chestxray-dataset |
Retina | Multi | STARE:Structured Analysis of the Retina | OA | Retinal fundus | http://cecas.clemson.edu/~ahoover/stare |
Retina | Diabetes | CHASE_DB1 | OA | Retinal fundus | https://blogs.kingston.ac.uk/retinal/chasedb1 |
Retina | Diabetes | High-Resolution Fundus (HRF) Image Database | OA | Retinal fundus | https://www5.cs.fau.de/research/data/fundus-images |
Skin | Lesion | International Skin Imaging Collaboration (ISIC) | OA | Digital images | https://www.isic-archive.com |
Ser | Abbreviation | Long |
---|---|---|
1 | NM | Nuclear medicine |
2 | CT | Computerized Tomography |
3 | CR | Computed Radiography |
4 | PET, PT | Positron Emission Tomography |
5 | MR | Magnetic Resonance |
6 | MG | Mammography |
7 | DX | Digital Radiography |
8 | RF | Radio Fluoroscopy |
9 | US | Ultrasound |
10 | XA | X-Ray Angiography |
11 | RTDOSE | Radiotherapy Dose |
12 | RTSTRUCT | Radiotherapy Structure Set |
13 | RTPLAN | Radiotherapy Plan |