-
Notifications
You must be signed in to change notification settings - Fork 23
Input & Output
We here describe relevant information about parameters, input and output formats for CELLECT.
Here we describe selected parameters in the config file (config.yml
).
This parameter determines which analysis types will be run. You can run multiple analysis types in the same workflow.
-
prioritization
: Performs cell-type prioritization on all cell-types in the specificity input. -
conditional
: Performs cell-type prioritization on all cell-types in the specificity input conditioned on each cell-type listed in theCONDITIONAL_INPUT
argument (see config file for details). -
heritability
(CELLECT-LDSC only): Estimate the SNP heritability of cell-types listed in theHERITABILITY_INPUT
argument (see config file for details). Note that your specificity values should be constrained to 0-1 for this estimation to be valid. (Specificity input generated from CELLEX is constrained to 0-1.) -
heritability_intervals
(CELLECT-LDSC only): Estimate the 'interval heritability' of the cell-types listed in theHERITABILITY_INPUT
. The heritability is estimated for five equally spaced intervals of the cell-types' expression specificity values: (0-0.2], (0.2-0.4], (0.4-0.6], (0.6-0.8], (0.8-1], as well as the interval including zero values only ([0-0]). Specificity values must be constrained to 0-1.
Path to the expression specificity input matrix containing the cell-types to analyze.
- tabular file (csv delimiter). Genes in the first column and cell-type annotations in the subsequent columns
- cell-type annotation header names must not contain special characters, spaces or double underscores
- gene names must be in Ensembl human format.
- gene column header name (i.e. first column) should be named gene as seen in the example below.
- the file can be uncompressed or compressed (gz/bz2 formats are supported).
gene | Bladder.bladder_cell | ... | Trachea.mesenchymal_cell |
---|---|---|---|
ENSG00000081791 | 0.43 | ... | 0.11 |
... | ... | ... | ... |
ENSG00000101440 | 0.21 | ... | 0.89 |
NB: The specificity values should be between 0 and 1 if using CELLECT-LDSC to estimate cell-type annotation heritability.
We recommend using the python program CELLEX to generate the specificity input matrix. Specificity inputs can be generated easily:
import cellex
eso = cellex.ESObject(df=sc_rnaseq_data, annotation=celltype_labels)
eso.compute()
eso.save('example/specificity_input_matrix.csv.gz')
File paths to 'munged' GWAS summary statistic files. We recommend using the ldsc/mtag_munge.py
to munge GWAS summary statistics as it checks all the gotchas when processing the file. Internally CELLECT uses SNPs' rsIDs, and not chromosomal coordinates, so it does not matter what genome build the GWAS used.
See the CELLECT-LDSC tutorial for an example of munging summary statistics. If you use the mtag_munge.py
your munged GWAS file will comply with the file format.
Files must be tab-separated with header (see below). The files can be uncompressed or compressed (gz/bz2 formats are supported).
For CELLECT-LDSC, the required columns are:
-
SNP
: the unique SNP identifier (e.g. rsID number) -
N
: sample size (which may vary from SNP to SNP) -
Z
: the Z-score associated with the SNP effect sizes for the GWAS Additional columns are allowed but will be ignored.
For CELLECT-MAGMA, the required columns are:
-
SNP
: the unique SNP identifier (e.g. rsID number) -
N
: sample size (which may vary from SNP to SNP) -
PVAL
: the P-value associated with the SNP effect sizes for the GWAS Additional columns are allowed but will be ignored.
Note on input GWAS
: The ancestry of the population for the input GWAS should be European, as both LDSC and MAGMA use LD information from 1000 Genome Project individuals with European ancestry. In addition we advise against using GWAS summary stats that were performed on custom SNP-arrays (e.g. Metabo chip etc). This also includes GWAS meta-analysis that includes custom SNP-arrays. See the S-LDSC paper (Finucane, Nat. Gen 2015) supplementary materials for details.
Bonus
: For a collection of GWAS data repositories, see our GWAS-datasets wikipage.
For the curious and technically minded
: CELLECT uses gene coordinates from genome build GRCh37/hg19 when mapping SNPs to genes. Only genes in the specificity input that exists in the gene coordinate file will contribute to the cell-type prioritization: protein-coding autosomal genes. (Gene coordinate file: data/shared/gene_coordinates.GRCh37.ensembl_v91.txt
).
The WINDOW_DEFINITION
parameters are used for mapping gene specificity values to SNPs.
-
WINDOW_SIZE_KB
(default): genes’ specificity values are assigned to SNPs utilizing aWINDOW_SIZE_KB
kilobase (kb) windows of the genes’ transcribed regions. -
WINDOW_LD_BASED
(CELLECT-LDSC only): genes’ specificity values are assigned to SNPs utilizing a LD-based loci (r=0.5
).
For both window definitions, if a SNP overlaps with multiple genes within the window/loci, CELLECT assigns the maximum specificity value. CELLECT results are generally robust to changes in window definition and size.
This is a path to a directory where you would like your output to be saved. Ideally, use a path to a fast storage drive to speed up computation. 0.4-3 GBs of space are usually needed for each specificity input but additional GWAS summary stats will not take up much more storage.
The workflow generates the following output file structure:
<BASE_OUTPUT_DIR>
CELLECT-LDSC/
├── results
| |── prioritization.csv
| |── conditional.csv
| |── heritability.csv
| └── heritability_intervals.csv
├── out
│ |── prioritization
| | |── <specificity_input_id>-__<gwas_id>.cell_type_results.txt
| | └──....
│ ├── conditional
| | |── <specificity_input_id>-__<gwas_id>__CONDITIONAL__<conditional_celltype_annotation>.cell_type_results.txt
| | └──....
| └── heritability
| |── <specificity_input_id>-__<gwas_id>__<celltype_annotation>.results.txt
| └──....
└── precomputation
│ └── ....
│
CELLECT-MAGMA/
├── results
| |── prioritization.csv
| └── conditional.csv
├── out
│ |── prioritization
| | |── <specificity_input_id>-__<gwas_id>.cell_type_results.txt
| | └──....
│ └── conditional
| |── <specificity_input_id>-__<gwas_id>__CONDITIONAL__<conditional_celltype_annotation>.cell_type_results.txt
| └──....
└── precomputation
└── ....
Note that the generated output directories/files are dependent on what mode you run the workflow (e.g. prioritization, conditional or heritability). The subdirectories BASE_OUTPUT_DIR/CELLECT-LDSC/results
and BASE_OUTPUT_DIR/CELLECT-MAGMA/results
contain combined result files across all specificity inputs and GWAS traits analyzed. These files are the most convenient to use for downstream analysis.
This file is generated if prioritization: True
is specified in the config file under ANALYSIS_TYPE
.
Columns:
-
gwas
: GWAS study ID (specified in the config file) -
specificity_id
: expression specificity dataset ID (specified in the config file) -
annotation
: annotation ID (cell-type or tissue) -
beta
: regression effect size estimate for given annotation. This represents the change in per-SNP heritability due to the given cell type annotation, beyond what is explained by the set of all genes and baseline model -
beta_se
: standard error for the regression coefficient -
pvalue
: p-value for the positive association between trait heritability and cell-type annotation values. More formally, it is the p-value from a one-sided test thatbeta > 0
For a CELLECT-LDSC example output file see prioritization.csv For an example plot of the output file, see fig_celltypepriori.pdf.
This file is generated if conditional: True
is specified in the config file under ANALYSIS_TYPE
.
Columns:
The file contained the same columns as prioritization.csv
but with an extra column conditional_annotation
listing the annotation conditioned on.
For a CELLECT-LDSC output example see conditional.csv
This file is generated if heritability: True
is specified in the config file under ANALYSIS_TYPE
.
Columns:
-
gwas
: GWAS study ID (specified in the config file) -
specificity_id
: expression specificity dataset ID (specified in the config file) -
annotation
: annotation ID (cell-type or tissue) -
Prop._SNPs
: 'annotation size' that measures the proportion of SNPs covered by the cell-type/tissue annotation (0 means no SNPs were covered by the annotation; 1 means all SNPs were covered) -
Prop._h2
: proportion of trait SNP heritability explained by the annotation -
Prop._h2_std_error
: standard error for the annotation's heritability estimate -
h2_enrichment
: annotation heritability enrichment (Prop._h2/Prop._SNPs) -
h2_enrichment_se
: standard error for the heritability enrichment -
h2_enrichment_pvalue
: p-value for the heritability enrichment
For an example output file see heritability.csv For an example plot of the output file, see fig_conditional.pdf.
For details on the math and implementation of heritability estimation using S-LDSC, please see LDSC Partitioned Heritability wiki, Finucane, Nature Genetics 2015 and Gazal, Nature Genetics 2017.
This file is generated if heritability_intervals: True
is specified in the config file under ANALYSIS_TYPE
.
Columns:
-
gwas
: GWAS study ID (specified in the config file) -
specificity_id
: expression specificity dataset ID (specified in the config file) -
annotation
: annotation ID (cell-type or tissue) -
q
: this column lists the expression specificity interval. The five values and corresponding intervals are:0
=[0-0],1
=(0-0.2],2
=(0.2-0.4],3
=(0.4-0.6],4
=(0.6-0.8],5
=(0.8-1] -
h2g
: heritability estimate -
h2g_se
: standard error of the heritability estimate -
prop_h2g
: same asProp._h2
in inheritability.csv
-
prop_h2g_se
:Prop._h2_std_error
in inheritability.csv
-
enr
: same ash2_enrichment
in inheritability.csv
-
enr_se
: same ash2_enrichment_se
in inheritability.csv
-
enr_pval
: same ash2_enrichment_pvalue
in inheritability.csv
For an example file see heritability_intervals.csv. For an example plot of the output file, see fig_h2_annotation_intervals.pdf
For details on the math and implementation of heritability estimation for 'annotation intervals', please see LDSC Partitioned Heritability from Continuous Annotations wiki and Gazal, Nature Genetics 2017.