Skip to content

CELLECT GENES Tutorial

JonThom edited this page Sep 1, 2020 · 39 revisions

About CELLECT-GENES

CELLECT-GENES is a workflow to identify genes 'driving' the prioritization of cell types. These genes are found by intersecting the top specifically expressed genes with genes enriched for genetic signal.

Example usage: You've used CELLECT-LDSC or CELLECT-MAGMA to prioritize cell types for a GWAS of some horrible disease you've studied for years, but still quite don't understand. CELLECT helped you find relevant cell types to study, but you now you want to identify the genes that drive the genetic enrichment of prioritized cell types, so you can run back in the lab and do more experiments. If this is you, CELLECT-GENES will be your friend.

CELLECT-GENES tutorial

This tutorial will take you through running CELLECT-GENES on two GWAS summary stats and two example expression specificity inputs.

NB: CELLECT-GENES requires specificity input from CELLEX. For information on expression specificity and ESmu, please see the CELLEX documentation and Timshel (bioRxiv, 2020).

Tl;dr

1) download and munge GWAS summary stats (see CELLECT-LDSC Tutorial)
2) conda activate <env_with_snakemake>
3) snakemake --use-conda -j -s cellect-genes.snakefile --configfile config.yml
4) see your results in effector_genes.csv

1. Preparation

Download and munge GWAS summary stats (see step 0 and 1 in CELLECT-LDSC Tutorial)

The first time you run the workflow, snakemake will download and install local conda environments in ./.snakemake. These environments ensure that all dependencies are correctly installed. CELLECT-GENES is unlikely to work without the --use-conda flag.

2. Run CELLECT-GENES

Run the following command:

snakemake --use-conda -j -s cellect-genes.snakefile --configfile config.yml

The above command is configured to output results in ./CELLECT-EXAMPLE. To change this open the config.yml file and edit the BASE_OUTPUT_DIR to specify the output directory. The config file is preconfigured to prioritize the two CELLEX specificity inputs for each of the two GWAS datasets you just downloaded.

Running the workflow should take 5-15 minutes depending on the available number of cores on your system. Here we run the workflow using all available cores on the computer (-j). If you wish to use only 4 cores, just pass the -j 4 flag.

NB: Running CELLECT-GENES requires having the snakemake available in your environment. So make sure you activate an environment with snakemake installed before running the command.

Bonus info: CELLECT-GENES uses the CELLECT-MAGMA workflow to generate gene-level p-values. If the MAGMA workflow has already been run, CELLECT-GENES will detect and use existing up-to-date outputs.

3. Output

For each cell type, CELLECT-GENES outputs the set of genes that are both among the 1000 lowest MAGMA p-values and over the 90th percentile of cell type specific genes. Note that these threshold values have been set heuristically, and may in future be determined using more sophisticated methods. The cutoffs can be changed by editing the N_GENES_MAGMA and PERCENTILE_CUTOFF_ESMU parameters in the config.yml file.

In ./CELLECT-EXAMPLE/CELLECT-GENES/results/effector_genes.csv you will see the following output:

gwas specificity_id annotation gene_ensembl gene_symbol esmu_percentile magma_gene_percentile esmu magma_gene_pval
BMI_Yengo2018 mousebrain-test ABC ENSG00000163435 ELF3 99.15 98.23 0.98 9.6317e-17
BMI_Yengo2018 mousebrain-test ABC ENSG00000162366 PDZK1IP1 98.72 98.38 0.97 2.8349999999999995e-17
BMI_Yengo2018 mousebrain-test ABC ENSG00000173281 PPP1R3B 97.65 99.33 0.95 3.153099999999999e-27

To pinpoint the genes driving the genetic signal in specific cell types, open the table in a spreadsheet editor and filter the effector_genes.csv table by annotation.

Alternatively, in the case where you have found gene "Alpha" to be a top effector gene for cell type "ABC", you may want to check whether it is also highly specific to any other cell types. This can be done by looking up the gene in the gene_ensembl or gene_symbol column.

effector_genes.csv columns:

  • gwas: GWAS study ID (specified in the config file).
  • specificity_id: expression specificity dataset ID (specified in the config file).
  • annotation: annotation ID (cell type or tissue).
  • gene_ensembl: gene Ensembl id.
  • gene_symbol: gene symbol.
  • esmu_percentile: gene expression specificity percentile among genes specifically expressed (non-zero ESmu) in the annotation cell type
  • magma_gene_percentile: gene p-value percentile for genetic enrichment with respect to gwas (does not depend on cell type).
  • esmu: gene expression specificity with respect to annotation cell type.
  • magma_gene_pval: p-value for the positive association between a trait and a gene (does not depend on cell type). (Put simply, MAGMA aggregates SNP-level statistics to gene-level statistics while accounting for linkage disequilibrium). For details on how the gene-level P-value is calculated see CELLECT MAGMA Docs as well as the MAGMA website.

See Input & Output for a description of all CELLECT output files.

Clone this wiki locally