Skip to content

Version 1.3

Latest
Compare
Choose a tag to compare
@mlist mlist released this 11 Dec 22:01

Version 1.3

At this time, this repository is only for sample metadata, not experiment metadata.
For more information about experiment metadata check out the IHEC Data Portal and EpiATLAS.
You can also find this metadata on EpiRR.
The current version of the sample metadata contains 2279 EpiRR entries. This is the number of entries for which reprocessed data is available.
The CSV for the core sample metadata can be found at openrefine/v1.3/IHEC_sample_metadata_harmonization.v1.3.csv and the extended version at openrefine/v1.3/IHEC_sample_metadata_harmonization.v1.3_extended.csv

News

  • There is now a core version of the metadata table containing updated column names that are designed to be easier to interpret. The core version contains a selected set of columns from the extended version. The core column names are mapped to the names in the extended version, as shown in the table below.
  • Added column Curated Biospecimen Label/harmonized_sample_label: The most detailed sample label is derived through manual curation of the sample ontology, disease, and cell surface markers, utilizing widely accepted terms. In some cases, curated biospecimen labels reference multiple intermediate biospecimen labels to maintain alignment with the ontologies reported by the production centers. Assigned by Martin Hirst with the help of others.
  • Added column EpiRR Ordering/EpiRR_ordering, giving the index of the EpiRR entry in the table.
    • Ordering of rows is now based on the following columns in this order: harmonized_sample_ontology_term_high_order_fig1, harmonized_sample_ontology_intermediate, harmonized_sample_label, harmonized_sample_disease_high, harmonized_sample_disease_intermediate, harmonized_donor_sex, automated_harmonized_donor_age_in_years, and EpiRR (harmonized_sample_ontology_term_high_order_fig1 and harmonized_sample_ontology_intermediate ordered manually; age sorted as double; other columns sorted ignoring case).
  • Some changes in harmonized_sample_ontology_intermediate.
  • Fixed harmonized_donor_life_stage for 5 entries.
  • For 12 entries: reassignment of harmonized_biomaterial_type and consequently changes in harmonized_cell_type, harmonized_sample_ontology_intermediate, and harmonized_sample_ontology_curie.
  • harmonized_sample_ontology_term_high_order_fig1_color contains a coloring for each value in harmonized_sample_ontology_term_high_order_fig1.
  • harmonized_sample_ontology_intermediate_color contains a coloring for each value in harmonized_sample_ontology_intermediate.
  • In addition to the columns harmonized_donor_sex and harmonized_donor_life_stage that have been complemented and corrected, based on the high confidence predictions of the EpiClass tool, the extended version now contains the columns without these corrections, i.e., ${column}_uncorrected.
  • The columns containing information about whether data is available have been renamed to contain the assay name, e.g., automated_experiments_ChIP-Seq_H3K27ac. WGBS and RNA-Seq columns have been separated by PBAT vs. standard and mRNA-Seq vs. total-RNA-Seq.

Raw Files

In case you are interested in the raw files that the harmonization process was based on, those can be found at raw/EpiAtlas_EpiRR_metadata_all.csv.
Note that they contain different column names, as they changed during the harmonization process.

Diff

The overall diff between v1.2 and v1.3 can be found at openrefine/v1.3/diff_v1.2_v1.3.json

Metadata Standard

Please keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Column descriptions:

This table described the columns in the core metadata table.

Core Column Corresponding Extended Column Examples Explanation # Not Null (%)
EpiRR Ordering EpiRR_ordering 1 2279 Index in the table. Ordering of rows is now based on the following columns (harmonized_sample_ontology_term_high_order_fig1 and harmonized_sample_ontology_intermediate ordered manually; age sorted as double; other columns sorted ignoring case) in this order: harmonized_sample_ontology_term_high_order_fig1, harmonized_sample_ontology_intermediate, harmonized_sample_label, harmonized_sample_disease_high, harmonized_sample_disease_intermediate, harmonized_donor_sex, automated_harmonized_donor_age_in_years, and EpiRR. 2279 (100.0%)
EpiRR EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version. 2279 (100.0%)
Biospecimen Disease harmonized_sample_disease_high Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. 2279 (100.0%)
Broad Biospecimen Label harmonized_sample_ontology_term_high_order_fig1 T lymphocyte epithelial stem cell Semi-manual merging of values from harmonized_sample_ontology_intermediate by Jonathan Steif. Had been applied to a preliminary v1.2. 2279 (100.0%)
Broad Colour harmonized_sample_ontology_term_high_order_fig1_color "168,90,36" "143,81,121" A color mapping for the entries in harmonized_sample_ontology_term_high_order_fig1. 2279 (100.0%)
Intermediate Biospecimen Label harmonized_sample_ontology_intermediate T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology. 2279 (100.0%)
Intermediate Colour harmonized_sample_ontology_intermediate_color "143,81,121" A unique color for each unique entry in harmonized_sample_ontology_intermediate. 2279 (100.0%)
Curated Biospecimen Label harmonized_sample_label B Lymphocyte Acute Lymphoblastic Leukemia Sample label based on sample ontology and sample disease using common terms that might connect multiple ontologies or columns by Martin Hirst. 2279 (100.0%)

The table below describes the columns included in the extended metadata table.

Extended Column Corresponding Core Column Examples Explanation # Not Null (%)
EpiRR EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version. 2279 (100.0%)
project CEEHRC BLUEPRINT The project from which the epigenome originated. 2279 (100.0%)
harmonized_biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue. 2279 (100.0%)
harmonized_sample_label Curated Biospecimen Label B Lymphocyte Acute Lymphoblastic Leukemia Sample label based on sample ontology and sample disease using common terms that might connect multiple ontologies or columns by Martin Hirst. 2279 (100.0%)
harmonized_sample_ontology_intermediate Intermediate Biospecimen Label T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology. 2279 (100.0%)
harmonized_sample_ontology_intermediate_color Intermediate Colour "143,81,121" A unique color for each unique entry in harmonized_sample_ontology_intermediate. 2279 (100.0%)
harmonized_sample_disease_high Biospecimen Disease Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. 2279 (100.0%)
harmonized_sample_disease_intermediate Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. 2279 (100.0%)
harmonized_EpiRR_status Complete Partial Whether this epigenome is Complete or Partial. 2279 (100.0%)
epiATLAS_status Complete Partial Complete_imputed Equivalent to harmonized_EpiRR_status but referring to the reprocessed data rather than original submitted data, describing the status of the reference epigenome with the additional information of full epigenomes when using imputed data. 2279 (100.0%)
harmonized_cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture. 1562 (68.5%)
harmonized_cell_line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line. 73 (3.2%)
harmonized_tissue_type skeletal muscle tissue amygdala The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue. 2008 (88.1%)
harmonized_sample_ontology_curie CL:0000990 UBERON:0001876 EFO:0001200 The CURIE identifying the sample ontology term. Different ontologies are used, depending on the biomaterial_type: 'CL' for primary cell or primary cell culture, 'EFO' for cell line and 'UBERON' for primary tissue. 2279 (100.0%)
harmonized_cell_markers CD3+ CD4+ CD45RA+ CD3- CD19- CD56- Markers used to isolate and identify the cell type, when applicable. 1144 (50.2%)
automated_harmonized_sample_ontology CL UBERON EFO Automatic extraction from harmonized_sample_ontology_curie. The ontology corresponding to the curie, mostly used for other automatic extractions. 2279 (100.0%)
automated_harmonized_sample_ontology_term myeloid cell MCF 10A amygdala Automatic extraction from harmonized_sample_ontology_curie. The term corresponding to the curie, mostly used for detecting inconsistencies. 2279 (100.0%)
harmonized_sample_ontology_term_high_order_fig1 Broad Biospecimen Label T lymphocyte epithelial stem cell Semi-manual merging of values from harmonized_sample_ontology_intermediate by Jonathan Steif. Had been applied to a preliminary v1.2. 2279 (100.0%)
harmonized_sample_ontology_term_high_order_fig1_color Broad Colour "168,90,36" "143,81,121" A color mapping for the entries in harmonized_sample_ontology_term_high_order_fig1. 2279 (100.0%)
harmonized_sample_organ_system_order_AnetaMikulasova Immune System Nervous Annotation of organ system by Aneta Mikulasova. Had been applied to a preliminary v1.2. 2279 (100.0%)
harmonized_sample_organ_order_AnetaMikulasova blood-venous brain x Annotation of organ by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_organ_part_or_lineage_order_AnetaMikulasova Myeloid Lymphoid x frontal-lobe-brodmann-area-9 Annotation of organ part or lineage by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_cell_order_AnetaMikulasova Tcell Bcell x Annotation of cell type by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_cell_2_order_AnetaMikulasova CD4 mature x Annotation of cell subtype by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_cell_3_order_AnetaMikulasova alpha-beta helper x Annotation of cell subsubtype by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_cancer_type_order_AnetaMikulasova CLL AML x Annotation of cancer type by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_cancer_subtype_order_AnetaMikulasova hepatocellular anaplastic x Annotation of cancer subtype by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. 2279 (100.0%)
harmonized_sample_disease Breast Carcinoma Acute Promyelocytic Leukemia with PML-RARA This attribute reflects the disease for this particular sample, not the donor health condition. 2142 (94.0%)
harmonized_sample_disease_ontology_curie NCIM:C0678222 NCIM:C0023487 The CURIE identifying the NCIM disease ontology term. 2142 (94.0%)
automated_harmonized_sample_disease_ontology_curie_ncit NCIT:C41132 NCIT:C4872 Automatic exctraction from harmonized_sample_disease_ontology_curie, mostly used for other automatic extractions. 2134 (93.6%)
harmonized_donor_type Single donor Composite Pooled samples Composite is a reference generated from analysis objects generated from multiple individuals, ie H3K27ac ChIP-seq is subject A; RNA-seq is Subject B. Pooled samples are references generated from a biological pool, for example cord blood from 134 individual cords pooled together. 2279 (100.0%)
harmonized_donor_id CEMT0007 C07015 Identifier for donors within their projects. 2116 (92.8%)
harmonized_donor_age 60-65 unknown 46 Age of donor. Can be an interval. 2279 (100.0%)
harmonized_donor_age_unit year day week unknown Age unit of donor. 2279 (100.0%)
automated_harmonized_donor_age_in_years 32.5 67.5 Age of donor converted to years (mean for intervals). 1678 (73.6%)
harmonized_donor_life_stage adult child embryonic fetal newborn postnatal unknown Life stage of donor. Corrected and complemented using EpiClass. 2279 (100.0%)
harmonized_donor_life_stage_uncorrected adult child embryonic fetal newborn postnatal unknown Life stage of donor. Uncorrected and uncomplemented. 2279 (100.0%)
harmonized_donor_sex female male mixed unknown Sex of donor. Corrected and complemented using EpiClass. 2279 (100.0%)
harmonized_donor_sex_uncorrected female male mixed unknown Sex of donor. Uncorrected and uncomplemented. 2279 (100.0%)
harmonized_donor_health_status Breast Carcinoma Acute Promyelocytic Leukemia with PML-RARA The health status of the donor that provided the sample. Does not describe the disease for this particular sample. 982 (43.1%)
harmonized_donor_health_status_ontology_curie NCIM:C0023487 NCIM:C0678222 The CURIE identifying the NCIM donor health status ontology term. 982 (43.1%)
automated_harmonized_donor_health_status_ontology_curie_ncit NCIT:C3167 Automatic exctraction from harmonized_donor_health_status_ontology_curie, mostly used for other automatic extractions. 961 (42.2%)
automated_experiments_ChIP-Seq_H3K27ac f71ea030-5c25-4b10-8d23-afc537e49870 imputed Contains the uuid for observed data, or imputed if only imputed data is available. 1698 (74.5%)
automated_experiments_ChIP-Seq_H3K27me3 " Contains the uuid for observed data, or imputed if only imputed data is available. 1698 (74.5%)
automated_experiments_ChIP-Seq_H3K36me3 " Contains the uuid for observed data, or imputed if only imputed data is available. 1698 (74.5%)
automated_experiments_ChIP-Seq_H3K4me1 " Contains the uuid for observed data, or imputed if only imputed data is available. 1698 (74.5%)
automated_experiments_ChIP-Seq_H3K4me3 " Contains the uuid for observed data, or imputed if only imputed data is available. 1698 (74.5%)
automated_experiments_ChIP-Seq_H3K9me3 " Contains the uuid for observed data, or imputed if only imputed data is available. 1698 (74.5%)
automated_experiments_WGBS_standard " Contains the uuid for observed data, or imputed if only imputed data is available. 1859 (81.6%)
automated_experiments_WGBS_PBAT " Contains the uuid for observed data. Imputed WGBS can be found in column automated_experiments_WGBS_standard. 132 (5.8%)
automated_experiments_RNA-Seq_mRNA-Seq " Contains the uuid for observed data. RNA-Seq data not imputed at this point. 396 (17.4%)
automated_experiments_RNA-Seq_total-RNA-Seq " Contains the uuid for observed data. RNA-Seq data not imputed at this point. 1159 (50.9%)
epirr_id_without_version IHECRE00000001 EpiRR identifier without version. 2279 (100.0%)
EpiRR_ordering EpiRR Ordering 1 2279 Index in the table. Ordering of rows is now based on the following columns (harmonized_sample_ontology_term_high_order_fig1 and harmonized_sample_ontology_intermediate ordered manually; age sorted as double; other columns sorted ignoring case) in this order: harmonized_sample_ontology_term_high_order_fig1, harmonized_sample_ontology_intermediate, harmonized_sample_label, harmonized_sample_disease_high, harmonized_sample_disease_intermediate, harmonized_donor_sex, automated_harmonized_donor_age_in_years, and EpiRR. 2279 (100.0%)