Skip to content

Input & Output

JonThom edited this page Aug 24, 2020 · 14 revisions

We here describe relevant information about parameters, input and output formats for CELLECT.

Table of contents

  • Input: config file settings.
  • Output: config file settings.

Input

Here we describe selected parameters in the config file (config.yml).

ANALYSIS_TYPE

This parameter determines which analysis types will be run. You can run multiple analysis types in the same workflow.

  • prioritization: Performs cell-type prioritization on all cell-types in the specificity input.
  • conditional: Performs cell-type prioritization on all cell-types in the specificity input conditioned on each cell-type listed in the CONDITIONAL_INPUT argument (see config file for details).
  • heritability (CELLECT-LDSC only): Estimate the SNP heritability of cell-types listed in the HERITABILITY_INPUT argument (see config file for details). Note that your specificity values should be constrained to 0-1 for this estimation to be valid. (Specificity input generated from CELLEX is constrained to 0-1.)
  • heritability_intervals (CELLECT-LDSC only): Estimate the 'interval heritability' of the cell-types listed in the HERITABILITY_INPUT. The heritability is estimated for five equally spaced intervals of the cell-types' expression specificity values: (0-0.2], (0.2-0.4], (0.4-0.6], (0.6-0.8], (0.8-1], as well as the interval including zero values only ([0-0]). Specificity values must be constrained to 0-1.

SPECIFICITY_INPUT

Path to the expression specificity input matrix containing the cell-types to analyze.

  • tabular file (csv delimiter). Genes in the first column and cell-type annotations in the subsequent columns
  • cell-type annotation header names must not contain special characters, spaces or double underscores
  • gene names must be in Ensembl human format.
  • gene column header name (i.e. first column) should be named gene as seen in the example below.
  • the file can be uncompressed or compressed (gz/bz2 formats are supported).
gene Bladder.bladder_cell ... Trachea.mesenchymal_cell
ENSG00000081791 0.43 ... 0.11
... ... ... ...
ENSG00000101440 0.21 ... 0.89

NB: The specificity values should be between 0 and 1 if using CELLECT-LDSC to estimate cell-type annotation heritability.

We recommend using the python program CELLEX to generate the specificity input matrix. Specificity inputs can be generated easily:

import cellex
eso = cellex.ESObject(df=sc_rnaseq_data, annotation=celltype_labels)
eso.compute()
eso.save('example/specificity_input_matrix.csv.gz')

GWAS_SUMSTATS: munged GWAS summary statistics

File paths to 'munged' GWAS summary statistic files. We recommend using the ldsc/mtag_munge.py to munge GWAS summary statistics as it checks all the gotchas when processing the file. Internally CELLECT uses SNPs' rsIDs, and not chromosomal coordinates, so it does not matter what genome build the GWAS used.

See the CELLECT-LDSC tutorial for an example of munging summary statistics. If you use the mtag_munge.py your munged GWAS file will comply with the file format.

Files must be tab-separated with header (see below). The files can be uncompressed or compressed (gz/bz2 formats are supported).

For CELLECT-LDSC, the required columns are:

  • SNP: the unique SNP identifier (e.g. rsID number)
  • N: sample size (which may vary from SNP to SNP)
  • Z: the Z-score associated with the SNP effect sizes for the GWAS Additional columns are allowed but will be ignored.

For CELLECT-MAGMA, the required columns are:

  • SNP: the unique SNP identifier (e.g. rsID number)
  • N: sample size (which may vary from SNP to SNP)
  • PVAL: the P-value associated with the SNP effect sizes for the GWAS Additional columns are allowed but will be ignored.

Note on input GWAS: The ancestry of the population for the input GWAS should be European, as both LDSC and MAGMA use LD information from 1000 Genome Project individuals with European ancestry. In addition we advise against using GWAS summary stats that were performed on custom SNP-arrays (e.g. Metabo chip etc). This also includes GWAS meta-analysis that includes custom SNP-arrays. See the S-LDSC paper (Finucane, Nat. Gen 2015) supplementary materials for details.

Bonus: For a collection of GWAS data repositories, see our GWAS-datasets wikipage.

For the curious and technically minded: CELLECT uses gene coordinates from genome build GRCh37/hg19 when mapping SNPs to genes. Only genes in the specificity input that exists in the gene coordinate file will contribute to the cell-type prioritization: protein-coding autosomal genes. (Gene coordinate file: data/shared/gene_coordinates.GRCh37.ensembl_v91.txt).

WINDOW_DEFINITION

The WINDOW_DEFINITION parameters are used for mapping gene specificity values to SNPs.

  • WINDOW_SIZE_KB (default): genes’ specificity values are assigned to SNPs utilizing a WINDOW_SIZE_KB kilobase (kb) windows of the genes’ transcribed regions.
  • WINDOW_LD_BASED (CELLECT-LDSC only): genes’ specificity values are assigned to SNPs utilizing a LD-based loci (r=0.5).

For both window definitions, if a SNP overlaps with multiple genes within the window/loci, CELLECT assigns the maximum specificity value. CELLECT results are generally robust to changes in window definition and size.

Output

BASE_OUTPUT_DIR: Output directory

This is a path to a directory where you would like your output to be saved. Ideally, use a path to a fast storage drive to speed up computation. 0.4-3 GBs of space are usually needed for each specificity input but additional GWAS summary stats will not take up much more storage.

Output directory structure

The workflow generates the following output file structure:

<BASE_OUTPUT_DIR>
CELLECT-LDSC/
├── results
|   |── prioritization.csv
|   |── conditional.csv
|   |── heritability.csv
|   └── heritability_intervals.csv
├── out
│   |── prioritization
|   |   |── <specificity_input_id>-__<gwas_id>.cell_type_results.txt
|   |   └──....
│   ├── conditional
|   |   |── <specificity_input_id>-__<gwas_id>__CONDITIONAL__<conditional_celltype_annotation>.cell_type_results.txt
|   |   └──....
|   └── heritability
|       |── <specificity_input_id>-__<gwas_id>__<celltype_annotation>.results.txt
|       └──....
└── precomputation
│   └── ....
│
CELLECT-MAGMA/
├── results
|   |── prioritization.csv
|   └── conditional.csv
├── out
│   |── prioritization
|   |   |── <specificity_input_id>-__<gwas_id>.cell_type_results.txt
|   |   └──....
│   └── conditional
|       |── <specificity_input_id>-__<gwas_id>__CONDITIONAL__<conditional_celltype_annotation>.cell_type_results.txt
|       └──....
└── precomputation
    └── ....

Note that the generated output directories/files are dependent on what mode you run the workflow (e.g. prioritization, conditional or heritability). The subdirectories BASE_OUTPUT_DIR/CELLECT-LDSC/results and BASE_OUTPUT_DIR/CELLECT-MAGMA/results contain combined result files across all specificity inputs and GWAS traits analyzed. These files are the most convenient to use for downstream analysis.

prioritization.csv

This file is generated if prioritization: True is specified in the config file under ANALYSIS_TYPE.

Columns:

  • gwas: GWAS study ID (specified in the config file)
  • specificity_id: expression specificity dataset ID (specified in the config file)
  • annotation: annotation ID (cell-type or tissue)
  • beta: regression effect size estimate for given annotation. This represents the change in per-SNP heritability due to the given cell type annotation, beyond what is explained by the set of all genes and baseline model
  • beta_se: standard error for the regression coefficient
  • pvalue: p-value for the positive association between trait heritability and cell-type annotation values. More formally, it is the p-value from a one-sided test that beta > 0

For a CELLECT-LDSC example output file see prioritization.csv For an example plot of the output file, see fig_celltypepriori.pdf.

conditional.csv

This file is generated if conditional: True is specified in the config file under ANALYSIS_TYPE.

Columns: The file contained the same columns as prioritization.csv but with an extra column conditional_annotation listing the annotation conditioned on.

For a CELLECT-LDSC output example see conditional.csv

heritability.csv (CELLECT-LDSC only)

This file is generated if heritability: True is specified in the config file under ANALYSIS_TYPE.

Columns:

  • gwas: GWAS study ID (specified in the config file)
  • specificity_id: expression specificity dataset ID (specified in the config file)
  • annotation: annotation ID (cell-type or tissue)
  • Prop._SNPs: 'annotation size' that measures the proportion of SNPs covered by the cell-type/tissue annotation (0 means no SNPs were covered by the annotation; 1 means all SNPs were covered)
  • Prop._h2: proportion of trait SNP heritability explained by the annotation
  • Prop._h2_std_error: standard error for the annotation's heritability estimate
  • h2_enrichment: annotation heritability enrichment (Prop._h2/Prop._SNPs)
  • h2_enrichment_se: standard error for the heritability enrichment
  • h2_enrichment_pvalue: p-value for the heritability enrichment

For an example output file see heritability.csv For an example plot of the output file, see fig_conditional.pdf.

For details on the math and implementation of heritability estimation using S-LDSC, please see LDSC Partitioned Heritability wiki, Finucane, Nature Genetics 2015 and Gazal, Nature Genetics 2017.

heritability_intervals.csv (CELLECT-LDSC only)

This file is generated if heritability_intervals: True is specified in the config file under ANALYSIS_TYPE.

Columns:

  • gwas: GWAS study ID (specified in the config file)
  • specificity_id: expression specificity dataset ID (specified in the config file)
  • annotation: annotation ID (cell-type or tissue)
  • q: this column lists the expression specificity interval. The five values and corresponding intervals are: 0=[0-0], 1=(0-0.2], 2=(0.2-0.4], 3=(0.4-0.6], 4=(0.6-0.8], 5=(0.8-1]
  • h2g: heritability estimate
  • h2g_se: standard error of the heritability estimate
  • prop_h2g: same as Prop._h2 in in heritability.csv
  • prop_h2g_se: Prop._h2_std_error in in heritability.csv
  • enr: same as h2_enrichment in in heritability.csv
  • enr_se: same as h2_enrichment_se in in heritability.csv
  • enr_pval: same as h2_enrichment_pvalue in in heritability.csv

For an example file see heritability_intervals.csv. For an example plot of the output file, see fig_h2_annotation_intervals.pdf

For details on the math and implementation of heritability estimation for 'annotation intervals', please see LDSC Partitioned Heritability from Continuous Annotations wiki and Gazal, Nature Genetics 2017.

Clone this wiki locally