Version 1.3
At this time, this repository is only for sample metadata, not experiment metadata.
For more information about experiment metadata check out the IHEC Data Portal and EpiATLAS.
You can also find this metadata on EpiRR.
The current version of the sample metadata contains 2279 EpiRR entries. This is the number of entries for which reprocessed data is available.
The CSV for the core sample metadata can be found at openrefine/v1.3/IHEC_sample_metadata_harmonization.v1.3.csv and the extended version at openrefine/v1.3/IHEC_sample_metadata_harmonization.v1.3_extended.csv
News
- There is now a core version of the metadata table containing updated column names that are designed to be easier to interpret. The core version contains a selected set of columns from the extended version. The core column names are mapped to the names in the extended version, as shown in the table below.
- Added column
Curated Biospecimen Label
/harmonized_sample_label
: The most detailed sample label is derived through manual curation of the sample ontology, disease, and cell surface markers, utilizing widely accepted terms. In some cases, curated biospecimen labels reference multiple intermediate biospecimen labels to maintain alignment with the ontologies reported by the production centers. Assigned by Martin Hirst with the help of others. - Added column
EpiRR Ordering
/EpiRR_ordering
, giving the index of the EpiRR entry in the table.- Ordering of rows is now based on the following columns in this order:
harmonized_sample_ontology_term_high_order_fig1
,harmonized_sample_ontology_intermediate
,harmonized_sample_label
,harmonized_sample_disease_high
,harmonized_sample_disease_intermediate
,harmonized_donor_sex
,automated_harmonized_donor_age_in_years
, andEpiRR
(harmonized_sample_ontology_term_high_order_fig1
andharmonized_sample_ontology_intermediate
ordered manually; age sorted as double; other columns sorted ignoring case).
- Ordering of rows is now based on the following columns in this order:
- Some changes in
harmonized_sample_ontology_intermediate
. - Fixed
harmonized_donor_life_stage
for 5 entries. - For 12 entries: reassignment of
harmonized_biomaterial_type
and consequently changes inharmonized_cell_type
,harmonized_sample_ontology_intermediate
, andharmonized_sample_ontology_curie
. harmonized_sample_ontology_term_high_order_fig1_color
contains a coloring for each value inharmonized_sample_ontology_term_high_order_fig1
.harmonized_sample_ontology_intermediate_color
contains a coloring for each value inharmonized_sample_ontology_intermediate
.- In addition to the columns
harmonized_donor_sex
andharmonized_donor_life_stage
that have been complemented and corrected, based on the high confidence predictions of the EpiClass tool, the extended version now contains the columns without these corrections, i.e.,${column}_uncorrected
. - The columns containing information about whether data is available have been renamed to contain the assay name, e.g.,
automated_experiments_ChIP-Seq_H3K27ac
. WGBS and RNA-Seq columns have been separated by PBAT vs. standard and mRNA-Seq vs. total-RNA-Seq.
Raw Files
In case you are interested in the raw files that the harmonization process was based on, those can be found at raw/EpiAtlas_EpiRR_metadata_all.csv.
Note that they contain different column names, as they changed during the harmonization process.
Diff
The overall diff between v1.2 and v1.3 can be found at openrefine/v1.3/diff_v1.2_v1.3.json
Metadata Standard
Please keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.
Column descriptions:
This table described the columns in the core metadata table.
Core Column | Corresponding Extended Column | Examples | Explanation | # Not Null (%) |
---|---|---|---|---|
EpiRR Ordering | EpiRR_ordering | 1 2279 |
Index in the table. Ordering of rows is now based on the following columns (harmonized_sample_ontology_term_high_order_fig1 and harmonized_sample_ontology_intermediate ordered manually; age sorted as double; other columns sorted ignoring case) in this order: harmonized_sample_ontology_term_high_order_fig1 , harmonized_sample_ontology_intermediate , harmonized_sample_label , harmonized_sample_disease_high , harmonized_sample_disease_intermediate , harmonized_donor_sex , automated_harmonized_donor_age_in_years , and EpiRR . |
2279 (100.0%) |
EpiRR | EpiRR | IHECRE00000001.4 |
EpiRR identifier. The number behind the dot (.) is the version. | 2279 (100.0%) |
Biospecimen Disease | harmonized_sample_disease_high | Healthy/None Cancer Disease |
A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. | 2279 (100.0%) |
Broad Biospecimen Label | harmonized_sample_ontology_term_high_order_fig1 | T lymphocyte epithelial stem cell |
Semi-manual merging of values from harmonized_sample_ontology_intermediate by Jonathan Steif. Had been applied to a preliminary v1.2. |
2279 (100.0%) |
Broad Colour | harmonized_sample_ontology_term_high_order_fig1_color | "168,90,36" "143,81,121" |
A color mapping for the entries in harmonized_sample_ontology_term_high_order_fig1 . |
2279 (100.0%) |
Intermediate Biospecimen Label | harmonized_sample_ontology_intermediate | T cell epithelial cell derived cell line |
A manually refined higher level annotation describing the samples using ancestors in the ontology. | 2279 (100.0%) |
Intermediate Colour | harmonized_sample_ontology_intermediate_color | "143,81,121" |
A unique color for each unique entry in harmonized_sample_ontology_intermediate . |
2279 (100.0%) |
Curated Biospecimen Label | harmonized_sample_label | B Lymphocyte Acute Lymphoblastic Leukemia |
Sample label based on sample ontology and sample disease using common terms that might connect multiple ontologies or columns by Martin Hirst. | 2279 (100.0%) |
The table below describes the columns included in the extended metadata table.
Extended Column | Corresponding Core Column | Examples | Explanation | # Not Null (%) |
---|---|---|---|---|
EpiRR | EpiRR | IHECRE00000001.4 |
EpiRR identifier. The number behind the dot (.) is the version. | 2279 (100.0%) |
project | CEEHRC BLUEPRINT |
The project from which the epigenome originated. | 2279 (100.0%) | |
harmonized_biomaterial_type | cell line primary cell primary cell culture primary tissue |
One of primary cell ,primary cell culture , cell line , primary tissue . |
2279 (100.0%) | |
harmonized_sample_label | Curated Biospecimen Label | B Lymphocyte Acute Lymphoblastic Leukemia |
Sample label based on sample ontology and sample disease using common terms that might connect multiple ontologies or columns by Martin Hirst. | 2279 (100.0%) |
harmonized_sample_ontology_intermediate | Intermediate Biospecimen Label | T cell epithelial cell derived cell line |
A manually refined higher level annotation describing the samples using ancestors in the ontology. | 2279 (100.0%) |
harmonized_sample_ontology_intermediate_color | Intermediate Colour | "143,81,121" |
A unique color for each unique entry in harmonized_sample_ontology_intermediate . |
2279 (100.0%) |
harmonized_sample_disease_high | Biospecimen Disease | Healthy/None Cancer Disease |
A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. | 2279 (100.0%) |
harmonized_sample_disease_intermediate | Carcinoma Leukemia |
A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. | 2279 (100.0%) | |
harmonized_EpiRR_status | Complete Partial |
Whether this epigenome is Complete or Partial . |
2279 (100.0%) | |
epiATLAS_status | Complete Partial Complete_imputed |
Equivalent to harmonized_EpiRR_status but referring to the reprocessed data rather than original submitted data, describing the status of the reference epigenome with the additional information of full epigenomes when using imputed data. |
2279 (100.0%) | |
harmonized_cell_type | myeloid cell effector memory CD8-positive, alpha-beta T cell |
The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture . |
1562 (68.5%) | |
harmonized_cell_line | MCF 10A |
The cell line and main sample ontology classification for entries where biomaterial_type is cell line . |
73 (3.2%) | |
harmonized_tissue_type | skeletal muscle tissue amygdala |
The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue . |
2008 (88.1%) | |
harmonized_sample_ontology_curie | CL:0000990 UBERON:0001876 EFO:0001200 |
The CURIE identifying the sample ontology term. Different ontologies are used, depending on the biomaterial_type : 'CL' for primary cell or primary cell culture , 'EFO' for cell line and 'UBERON' for primary tissue . |
2279 (100.0%) | |
harmonized_cell_markers | CD3+ CD4+ CD45RA+ CD3- CD19- CD56- |
Markers used to isolate and identify the cell type, when applicable. | 1144 (50.2%) | |
automated_harmonized_sample_ontology | CL UBERON EFO |
Automatic extraction from harmonized_sample_ontology_curie . The ontology corresponding to the curie, mostly used for other automatic extractions. |
2279 (100.0%) | |
automated_harmonized_sample_ontology_term | myeloid cell MCF 10A amygdala |
Automatic extraction from harmonized_sample_ontology_curie . The term corresponding to the curie, mostly used for detecting inconsistencies. |
2279 (100.0%) | |
harmonized_sample_ontology_term_high_order_fig1 | Broad Biospecimen Label | T lymphocyte epithelial stem cell |
Semi-manual merging of values from harmonized_sample_ontology_intermediate by Jonathan Steif. Had been applied to a preliminary v1.2. |
2279 (100.0%) |
harmonized_sample_ontology_term_high_order_fig1_color | Broad Colour | "168,90,36" "143,81,121" |
A color mapping for the entries in harmonized_sample_ontology_term_high_order_fig1 . |
2279 (100.0%) |
harmonized_sample_organ_system_order_AnetaMikulasova | Immune System Nervous |
Annotation of organ system by Aneta Mikulasova. Had been applied to a preliminary v1.2. | 2279 (100.0%) | |
harmonized_sample_organ_order_AnetaMikulasova | blood-venous brain x |
Annotation of organ by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_organ_part_or_lineage_order_AnetaMikulasova | Myeloid Lymphoid x frontal-lobe-brodmann-area-9 |
Annotation of organ part or lineage by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_cell_order_AnetaMikulasova | Tcell Bcell x |
Annotation of cell type by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_cell_2_order_AnetaMikulasova | CD4 mature x |
Annotation of cell subtype by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_cell_3_order_AnetaMikulasova | alpha-beta helper x |
Annotation of cell subsubtype by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_cancer_type_order_AnetaMikulasova | CLL AML x |
Annotation of cancer type by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_cancer_subtype_order_AnetaMikulasova | hepatocellular anaplastic x |
Annotation of cancer subtype by Aneta Mikulasova. Had been applied to a preliminary v1.2. x if not applicable. |
2279 (100.0%) | |
harmonized_sample_disease | Breast Carcinoma Acute Promyelocytic Leukemia with PML-RARA |
This attribute reflects the disease for this particular sample, not the donor health condition. | 2142 (94.0%) | |
harmonized_sample_disease_ontology_curie | NCIM:C0678222 NCIM:C0023487 |
The CURIE identifying the NCIM disease ontology term. | 2142 (94.0%) | |
automated_harmonized_sample_disease_ontology_curie_ncit | NCIT:C41132 NCIT:C4872 |
Automatic exctraction from harmonized_sample_disease_ontology_curie , mostly used for other automatic extractions. |
2134 (93.6%) | |
harmonized_donor_type | Single donor Composite Pooled samples |
Composite is a reference generated from analysis objects generated from multiple individuals, ie H3K27ac ChIP-seq is subject A; RNA-seq is Subject B. Pooled samples are references generated from a biological pool, for example cord blood from 134 individual cords pooled together. |
2279 (100.0%) | |
harmonized_donor_id | CEMT0007 C07015 |
Identifier for donors within their projects. | 2116 (92.8%) | |
harmonized_donor_age | 60-65 unknown 46 |
Age of donor. Can be an interval. | 2279 (100.0%) | |
harmonized_donor_age_unit | year day week unknown |
Age unit of donor. | 2279 (100.0%) | |
automated_harmonized_donor_age_in_years | 32.5 67.5 |
Age of donor converted to years (mean for intervals). | 1678 (73.6%) | |
harmonized_donor_life_stage | adult child embryonic fetal newborn postnatal unknown |
Life stage of donor. Corrected and complemented using EpiClass. | 2279 (100.0%) | |
harmonized_donor_life_stage_uncorrected | adult child embryonic fetal newborn postnatal unknown |
Life stage of donor. Uncorrected and uncomplemented. | 2279 (100.0%) | |
harmonized_donor_sex | female male mixed unknown |
Sex of donor. Corrected and complemented using EpiClass. | 2279 (100.0%) | |
harmonized_donor_sex_uncorrected | female male mixed unknown |
Sex of donor. Uncorrected and uncomplemented. | 2279 (100.0%) | |
harmonized_donor_health_status | Breast Carcinoma Acute Promyelocytic Leukemia with PML-RARA |
The health status of the donor that provided the sample. Does not describe the disease for this particular sample. | 982 (43.1%) | |
harmonized_donor_health_status_ontology_curie | NCIM:C0023487 NCIM:C0678222 |
The CURIE identifying the NCIM donor health status ontology term. | 982 (43.1%) | |
automated_harmonized_donor_health_status_ontology_curie_ncit | NCIT:C3167 |
Automatic exctraction from harmonized_donor_health_status_ontology_curie , mostly used for other automatic extractions. |
961 (42.2%) | |
automated_experiments_ChIP-Seq_H3K27ac | f71ea030-5c25-4b10-8d23-afc537e49870 imputed |
Contains the uuid for observed data, or imputed if only imputed data is available. |
1698 (74.5%) | |
automated_experiments_ChIP-Seq_H3K27me3 | " | Contains the uuid for observed data, or imputed if only imputed data is available. |
1698 (74.5%) | |
automated_experiments_ChIP-Seq_H3K36me3 | " | Contains the uuid for observed data, or imputed if only imputed data is available. |
1698 (74.5%) | |
automated_experiments_ChIP-Seq_H3K4me1 | " | Contains the uuid for observed data, or imputed if only imputed data is available. |
1698 (74.5%) | |
automated_experiments_ChIP-Seq_H3K4me3 | " | Contains the uuid for observed data, or imputed if only imputed data is available. |
1698 (74.5%) | |
automated_experiments_ChIP-Seq_H3K9me3 | " | Contains the uuid for observed data, or imputed if only imputed data is available. |
1698 (74.5%) | |
automated_experiments_WGBS_standard | " | Contains the uuid for observed data, or imputed if only imputed data is available. |
1859 (81.6%) | |
automated_experiments_WGBS_PBAT | " | Contains the uuid for observed data. Imputed WGBS can be found in column automated_experiments_WGBS_standard . |
132 (5.8%) | |
automated_experiments_RNA-Seq_mRNA-Seq | " | Contains the uuid for observed data. RNA-Seq data not imputed at this point. | 396 (17.4%) | |
automated_experiments_RNA-Seq_total-RNA-Seq | " | Contains the uuid for observed data. RNA-Seq data not imputed at this point. | 1159 (50.9%) | |
epirr_id_without_version | IHECRE00000001 |
EpiRR identifier without version. | 2279 (100.0%) | |
EpiRR_ordering | EpiRR Ordering | 1 2279 |
Index in the table. Ordering of rows is now based on the following columns (harmonized_sample_ontology_term_high_order_fig1 and harmonized_sample_ontology_intermediate ordered manually; age sorted as double; other columns sorted ignoring case) in this order: harmonized_sample_ontology_term_high_order_fig1 , harmonized_sample_ontology_intermediate , harmonized_sample_label , harmonized_sample_disease_high , harmonized_sample_disease_intermediate , harmonized_donor_sex , automated_harmonized_donor_age_in_years , and EpiRR . |
2279 (100.0%) |