-
Notifications
You must be signed in to change notification settings - Fork 23
CELLECT GENES Tutorial
CELLECT-GENES is a workflow to identify genes 'driving' the prioritization of cell types. These genes are found by intersecting the top specifically expressed genes with genes enriched for genetic signal.
Example usage: You've used CELLECT-LDSC or CELLECT-MAGMA to prioritize cell types for a GWAS of some horrible disease you've studied for years, but still quite don't understand. CELLECT helped you find relevant cell types to study, but you now you want to identify the genes that drive the genetic enrichment of prioritized cell types, so you can run back in the lab and do more experiments. If this is you, CELLECT-GENES will be your friend.
This tutorial will take you through running CELLECT-GENES on two GWAS summary stats and two example expression specificity inputs.
NB
: CELLECT-GENES requires specificity input from CELLEX. For information on expression specificity and ESmu, please see the CELLEX documentation and Timshel (bioRxiv, 2020).
1) download and munge GWAS summary stats (see CELLECT-LDSC Tutorial)
2) conda activate <env_with_snakemake>
3) snakemake --use-conda -j -s cellect-genes.snakefile --configfile config.yml
4) see your results in effector_genes.csv
Download and munge GWAS summary stats (see step 0 and 1 in CELLECT-LDSC Tutorial)
The first time you run the workflow, snakemake will download and install local conda environments in ./.snakemake
. These environments ensure that all dependencies are correctly installed. CELLECT-GENES is unlikely to work without the --use-conda
flag.
Run the following command:
snakemake --use-conda -j -s cellect-genes.snakefile --configfile config.yml
The above command is configured to output results in ./CELLECT-EXAMPLE
. To change this open the config.yml
file and edit the BASE_OUTPUT_DIR
to specify the output directory. The config file is preconfigured to prioritize the two CELLEX specificity inputs for each of the two GWAS datasets you just downloaded.
Running the workflow should take 5-15 minutes depending on the available number of cores on your system. Here we run the workflow using all available cores on the computer (-j
). If you wish to use only 4 cores, just pass the -j 4
flag.
NB
: Running CELLECT-GENES requires having the snakemake available in your environment. So make sure you activate an environment with snakemake installed before running the command.
Bonus info
: CELLECT-GENES uses the CELLECT-MAGMA workflow to generate gene-level p-values. If the MAGMA workflow has already been run, CELLECT-GENES will detect and use existing up-to-date outputs.
For each cell type, CELLECT-GENES outputs the set of genes that are both among the 1000 lowest MAGMA p-values and over the 90th percentile of cell type specific genes. Note that these threshold values have been set heuristically, and may in future be determined using more sophisticated methods. The cutoffs can be changed by editing the N_GENES_MAGMA
and PERCENTILE_CUTOFF_ESMU
parameters in the config.yml
file.
In ./CELLECT-EXAMPLE/CELLECT-GENES/results/effector_genes.csv
you will see the following output:
gwas | specificity_id | annotation | gene_ensembl | gene_symbol | esmu_percentile | magma_gene_percentile | esmu | magma_gene_pval |
---|---|---|---|---|---|---|---|---|
BMI_Yengo2018 | mousebrain-test | ABC | ENSG00000163435 | ELF3 | 99.15 | 98.23 | 0.98 | 9.6317e-17 |
BMI_Yengo2018 | mousebrain-test | ABC | ENSG00000162366 | PDZK1IP1 | 98.72 | 98.38 | 0.97 | 2.8349999999999995e-17 |
BMI_Yengo2018 | mousebrain-test | ABC | ENSG00000173281 | PPP1R3B | 97.65 | 99.33 | 0.95 | 3.153099999999999e-27 |
To pinpoint the genes driving the genetic signal in specific cell types, open the table in a spreadsheet editor and filter the effector_genes.csv
table by annotation
.
Alternatively, in the case where you have found gene "Alpha" to be a top effector gene for cell type "ABC", you may want to check whether it is also highly specific to any other cell types. This can be done by looking up the gene in the gene_ensembl
or gene_symbol
column.
effector_genes.csv
columns:
-
gwas
: GWAS study ID (specified in the config file). -
specificity_id
: expression specificity dataset ID (specified in the config file). -
annotation
: annotation ID (cell type or tissue). -
gene_ensembl
: gene Ensembl id. -
gene_symbol
: gene symbol. -
esmu_percentile
: gene expression specificity percentile among genes specifically expressed (non-zero ESmu) in theannotation
cell type -
magma_gene_percentile
: gene p-value percentile for genetic enrichment with respect togwas
(does not depend on cell type). -
esmu
: gene expression specificity with respect toannotation
cell type. -
magma_gene_pval
: p-value for the positive association between a trait and a gene (does not depend on cell type). (Put simply, MAGMA aggregates SNP-level statistics to gene-level statistics while accounting for linkage disequilibrium). For details on how the gene-level P-value is calculated see CELLECT MAGMA Docs as well as the MAGMA website.
See Input & Output for a description of all CELLECT output files.