Input & Output

We here describe relevant information about parameters, input and output formats for CELLECT.

Input

Here we describe selected parameters in the config file (config.yml).

`ANALYSIS_TYPE`

This parameter determines which analysis types will be run. You can run multiple analysis types in the same workflow.

prioritization: Performs cell-type prioritization on all cell-types in the specificity input.
conditional: Performs cell-type prioritization on all cell-types in the specificity input conditioned on each cell-type listed in the CONDITIONAL_INPUT argument (see config file for details).
heritability (CELLECT-LDSC only): Estimate the SNP heritability of cell-types listed in the HERITABILITY_INPUT argument (see config file for details). Note that your specificity values should be constrained to 0-1 for this estimation to be valid. (Specificity input generated from CELLEX is constrained to 0-1.)
heritability_intervals (CELLECT-LDSC only): Estimate the 'interval heritability' of the cell-types listed in the HERITABILITY_INPUT. The heritability is estimated for five equally spaced intervals of the cell-types' expression specificity values: (0-0.2], (0.2-0.4], (0.4-0.6], (0.6-0.8], (0.8-1], as well as the interval including zero values only ([0-0]). Specificity values must be constrained to 0-1.

`SPECIFICITY_INPUT`

Path to the expression specificity input matrix containing the cell-types to analyze.

tabular file (csv delimiter). Genes in the first column and cell-type annotations in the subsequent columns
cell-type annotation header names must not contain special characters, spaces or double underscores
gene names must be in Ensembl human format.
gene column header name (i.e. first column) should be named gene as seen in the example below.
the file can be uncompressed or compressed (gz/bz2 formats are supported).

gene	Bladder.bladder_cell	...	Trachea.mesenchymal_cell
ENSG00000081791	0.43	...	0.11
...	...	...	...
ENSG00000101440	0.21	...	0.89

NB: The specificity values should be between 0 and 1 if using CELLECT-LDSC to estimate cell-type annotation heritability.

We recommend using the python program CELLEX to generate the specificity input matrix. Specificity inputs can be generated easily:

import cellex
eso = cellex.ESObject(df=sc_rnaseq_data, annotation=celltype_labels)
eso.compute()
eso.save('example/specificity_input_matrix.csv.gz')

`GWAS_SUMSTATS`: munged GWAS summary statistics

File paths to 'munged' GWAS summary statistic files. We recommend using the ldsc/mtag_munge.py to munge GWAS summary statistics as it checks all the gotchas when processing the file. Internally CELLECT uses SNPs' rsIDs, and not chromosomal coordinates, so it does not matter what genome build the GWAS used.

See the CELLECT-LDSC tutorial for an example of munging summary statistics. If you use the mtag_munge.py your munged GWAS file will comply with the file format.

Files must be tab-separated with header (see below). The files can be uncompressed or compressed (gz/bz2 formats are supported).

For CELLECT-LDSC, the required columns are:

SNP: the unique SNP identifier (e.g. rsID number)
N: sample size (which may vary from SNP to SNP)
Z: the Z-score associated with the SNP effect sizes for the GWAS Additional columns are allowed but will be ignored.

For CELLECT-MAGMA, the required columns are:

SNP: the unique SNP identifier (e.g. rsID number)
N: sample size (which may vary from SNP to SNP)
PVAL: the P-value associated with the SNP effect sizes for the GWAS Additional columns are allowed but will be ignored.

Note on input GWAS: The ancestry of the population for the input GWAS should be European, as both LDSC and MAGMA use LD information from 1000 Genome Project individuals with European ancestry. In addition we advise against using GWAS summary stats that were performed on custom SNP-arrays (e.g. Metabo chip etc). This also includes GWAS meta-analysis that includes custom SNP-arrays. See the S-LDSC paper (Finucane, Nat. Gen 2015) supplementary materials for details.

Bonus: For a collection of GWAS data repositories, see our GWAS-datasets wikipage.

For the curious and technically minded: CELLECT uses gene coordinates from genome build GRCh37/hg19 when mapping SNPs to genes. Only genes in the specificity input that exists in the gene coordinate file will contribute to the cell-type prioritization: protein-coding autosomal genes. (Gene coordinate file: data/shared/gene_coordinates.GRCh37.ensembl_v91.txt).

`WINDOW_DEFINITION`

The WINDOW_DEFINITION parameters are used for mapping gene specificity values to SNPs.

WINDOW_SIZE_KB (default): genes’ specificity values are assigned to SNPs utilizing a WINDOW_SIZE_KB kilobase (kb) windows of the genes’ transcribed regions.
WINDOW_LD_BASED (CELLECT-LDSC only): genes’ specificity values are assigned to SNPs utilizing a LD-based loci (r=0.5).

For both window definitions, if a SNP overlaps with multiple genes within the window/loci, CELLECT assigns the maximum specificity value. CELLECT results are generally robust to changes in window definition and size.

Output

`BASE_OUTPUT_DIR`: Output directory

This is a path to a directory where you would like your output to be saved. Ideally, use a path to a fast storage drive to speed up computation. 0.4-3 GBs of space are usually needed for each specificity input but additional GWAS summary stats will not take up much more storage.

Output directory structure

The workflow generates the following output file structure:

<BASE_OUTPUT_DIR>
CELLECT-LDSC/
├── results
|   |── prioritization.csv
|   |── conditional.csv
|   |── heritability.csv
|   └── heritability_intervals.csv
├── out
│   |── prioritization
|   |   |── <specificity_input_id>-__<gwas_id>.cell_type_results.txt
|   |   └──....
│   ├── conditional
|   |   |── <specificity_input_id>-__<gwas_id>__CONDITIONAL__<conditional_celltype_annotation>.cell_type_results.txt
|   |   └──....
|   └── heritability
|       |── <specificity_input_id>-__<gwas_id>__<celltype_annotation>.results.txt
|       └──....
└── precomputation
│   └── ....
│
CELLECT-MAGMA/
├── results
|   |── prioritization.csv
|   └── conditional.csv
├── out
│   |── prioritization
|   |   |── <specificity_input_id>-__<gwas_id>.cell_type_results.txt
|   |   └──....
│   └── conditional
|       |── <specificity_input_id>-__<gwas_id>__CONDITIONAL__<conditional_celltype_annotation>.cell_type_results.txt
|       └──....
└── precomputation
    └── ....

Note that the generated output directories/files are dependent on what mode you run the workflow (e.g. prioritization, conditional or heritability). The subdirectories BASE_OUTPUT_DIR/CELLECT-LDSC/results and BASE_OUTPUT_DIR/CELLECT-MAGMA/results contain combined result files across all specificity inputs and GWAS traits analyzed. These files are the most convenient to use for downstream analysis.

`prioritization.csv`

This file is generated if prioritization: True is specified in the config file under ANALYSIS_TYPE.

Columns:

gwas: GWAS study ID (specified in the config file)
specificity_id: expression specificity dataset ID (specified in the config file)
annotation: annotation ID (cell-type or tissue)
beta: regression effect size estimate for given annotation. This represents the change in per-SNP heritability due to the given cell type annotation, beyond what is explained by the set of all genes and baseline model
beta_se: standard error for the regression coefficient
pvalue: p-value for the positive association between trait heritability and cell-type annotation values. More formally, it is the p-value from a one-sided test that beta > 0

For a CELLECT-LDSC example output file see prioritization.csv For an example plot of the output file, see fig_celltypepriori.pdf.

`conditional.csv`

This file is generated if conditional: True is specified in the config file under ANALYSIS_TYPE.

Columns: The file contained the same columns as prioritization.csv but with an extra column conditional_annotation listing the annotation conditioned on.

For a CELLECT-LDSC output example see conditional.csv

`heritability.csv` (CELLECT-LDSC only)

This file is generated if heritability: True is specified in the config file under ANALYSIS_TYPE.

Columns:

gwas: GWAS study ID (specified in the config file)
specificity_id: expression specificity dataset ID (specified in the config file)
annotation: annotation ID (cell-type or tissue)
Prop._SNPs: 'annotation size' that measures the proportion of SNPs covered by the cell-type/tissue annotation (0 means no SNPs were covered by the annotation; 1 means all SNPs were covered)
Prop._h2: proportion of trait SNP heritability explained by the annotation
Prop._h2_std_error: standard error for the annotation's heritability estimate
h2_enrichment: annotation heritability enrichment (Prop._h2/Prop._SNPs)
h2_enrichment_se: standard error for the heritability enrichment
h2_enrichment_pvalue: p-value for the heritability enrichment

For an example output file see heritability.csv For an example plot of the output file, see fig_conditional.pdf.

For details on the math and implementation of heritability estimation using S-LDSC, please see LDSC Partitioned Heritability wiki, Finucane, Nature Genetics 2015 and Gazal, Nature Genetics 2017.

`heritability_intervals.csv` (CELLECT-LDSC only)

This file is generated if heritability_intervals: True is specified in the config file under ANALYSIS_TYPE.

Columns:

gwas: GWAS study ID (specified in the config file)
specificity_id: expression specificity dataset ID (specified in the config file)
annotation: annotation ID (cell-type or tissue)
q: this column lists the expression specificity interval. The five values and corresponding intervals are: 0=[0-0], 1=(0-0.2], 2=(0.2-0.4], 3=(0.4-0.6], 4=(0.6-0.8], 5=(0.8-1]
h2g: heritability estimate
h2g_se: standard error of the heritability estimate
prop_h2g: same as Prop._h2 in in heritability.csv
prop_h2g_se: Prop._h2_std_error in in heritability.csv
enr: same as h2_enrichment in in heritability.csv
enr_se: same as h2_enrichment_se in in heritability.csv
enr_pval: same as h2_enrichment_pvalue in in heritability.csv

For an example file see heritability_intervals.csv. For an example plot of the output file, see fig_h2_annotation_intervals.pdf

For details on the math and implementation of heritability estimation for 'annotation intervals', please see LDSC Partitioned Heritability from Continuous Annotations wiki and Gazal, Nature Genetics 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Input & Output

Table of contents

Input

`ANALYSIS_TYPE`

`SPECIFICITY_INPUT`

`GWAS_SUMSTATS`: munged GWAS summary statistics

`WINDOW_DEFINITION`

Output

`BASE_OUTPUT_DIR`: Output directory

Output directory structure

`prioritization.csv`

`conditional.csv`

`heritability.csv` (CELLECT-LDSC only)

`heritability_intervals.csv` (CELLECT-LDSC only)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Input & Output

Table of contents

Input

ANALYSIS_TYPE

SPECIFICITY_INPUT

GWAS_SUMSTATS: munged GWAS summary statistics

WINDOW_DEFINITION

Output

BASE_OUTPUT_DIR: Output directory

Output directory structure

prioritization.csv

conditional.csv

heritability.csv (CELLECT-LDSC only)

heritability_intervals.csv (CELLECT-LDSC only)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`ANALYSIS_TYPE`

`SPECIFICITY_INPUT`

`GWAS_SUMSTATS`: munged GWAS summary statistics

`WINDOW_DEFINITION`

`BASE_OUTPUT_DIR`: Output directory

`prioritization.csv`

`conditional.csv`

`heritability.csv` (CELLECT-LDSC only)

`heritability_intervals.csv` (CELLECT-LDSC only)