-
Notifications
You must be signed in to change notification settings - Fork 21
CELLECT LDSC Tutorial
This tutorial will take you through running CELLECT-LDSC on two GWAS summary stats and two example expression specificity inputs.
cd ~/CELLECT
# ---------------------- STEP 1: prep GWAS data ----------------- #
wget https://portals.broadinstitute.org/collaboration/giant/images/c/c8/Meta-analysis_Locke_et_al%2BUKBiobank_2018_UPDATED.txt.gz -P example/
wget https://www.dropbox.com/s/ho58e9jmytmpaf8/GWAS_EA_excl23andMe.txt -P example/
conda env create -f ldsc/environment_munge_ldsc.yml
conda activate munge_ldsc
python ldsc/mtag_munge.py \
--sumstats example/GWAS_EA_excl23andMe.txt \
--merge-alleles data/ldsc/w_hm3.snplist \
--n-value 766345 \
--keep-pval \
--p PVAL \
--out example/EA3_Lee2018
python ldsc/mtag_munge.py \
--sumstats example/Meta-analysis_Locke_et_al+UKBiobank_2018_UPDATED.txt.gz \
--a1 Tested_Allele \
--a2 Other_Allele \
--merge-alleles data/ldsc/w_hm3.snplist \
--keep-pval \
--p PVAL \
--out example/BMI_Yengo2018
# --- STEP 2: Generate cell-type specificity input using CELLEX --- #
(CELLEX specificity files have been pre-generated for this tutorial)
# ---------------------- STEP 3: run CELLECT-LDSC ----------------- #
conda activate <env_with_snakemake>
snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile config.yml
Change to the CELLECT git repository root, e.g.:
cd ~/CELLECT
Create a new conda environment that has the correct package versions for GWAS munging:
conda env create -f ldsc/environment_munge_ldsc.yml # creates 'munge_ldsc' environment
conda activate munge_ldsc
Download GWAS summary stats. Here we download the BMI GWAS from Yengo (HMG, 2018) and Educational Attainment GWAS from Lee (Nat. Gen., 2018).
wget https://portals.broadinstitute.org/collaboration/giant/images/c/c8/Meta-analysis_Locke_et_al%2BUKBiobank_2018_UPDATED.txt.gz -P example/
wget https://www.dropbox.com/s/ho58e9jmytmpaf8/GWAS_EA_excl23andMe.txt -P example/
Munge the GWAS data. This step ensures that 1) the GWAS summary stats are correctly formatted; 2) we only analyze HapMap3 SNPs. We suggest using the script ldsc/mtag_munge.py
for munging as it has more options:
python ldsc/mtag_munge.py \
--sumstats example/GWAS_EA_excl23andMe.txt \
--merge-alleles data/ldsc/w_hm3.snplist \
--n-value 766345 \
--keep-pval \
--p PVAL \
--out example/EA3_Lee2018
Munge BMI GWAS:
python ldsc/mtag_munge.py \
--sumstats example/Meta-analysis_Locke_et_al+UKBiobank_2018_UPDATED.txt.gz \
--a1 Tested_Allele \
--a2 Other_Allele \
--merge-alleles data/ldsc/w_hm3.snplist \
--keep-pval \
--p PVAL \
--out example/BMI_Yengo2018
Bonus info 1: the --keep-pval --p PVAL
arguments are only needed to make the munged summary stats compatible with CELLECT-MAGMA. Since you might want to run both CELLECT-LDSC and CELLECT-MAGMA, we recommend you make it a habit to add the arguments.
Bonus info 2: you will essentially get the same results if using ldsc/munge_sumstats.py
(just using --N
instead of --n-value
argument, and you have no option --keep-pval
that makes the output compatible with CELLECT-MAGMA):
python ldsc/munge_sumstats.py \
--sumstats example/GWAS_EA_excl23andMe.txt \
--merge-alleles data/ldsc/w_hm3.snplist \
--N 766345 \
--p PVAL \
--out example/EA3_Lee2018
Run python ldsc/munge_sumstats.py -h
and python ldsc/mtag_munge.py -h
to see how the two programs differ in the options.
Bonus info 2: On the wiki page GWAS-datasets we list several repositories containing publically available GWAS summary statistics. Check it out if you want to 'bulk download' GWAS summary stats.
CELLECT uses cell-type specificity matrix files as input. We recommend using CELLEX to generate these files, which can be done in python in a few lines, e.g.:
import cellex
eso = cellex.ESObject(df=mousebrain_sc_rnaseq_data, annotation=celltype_labels)
eso.compute()
eso.results["esmu"].to_csv("mousebrain-test.csv.gz")
Here we have pre-generated two example CELLEX specificity files for you:
example/mousebrain-test.csv
and
example/tabula_muris-test.csv
containing two cell-types from the Mousebrain dataset and five cell-types from the Tabula Muris dataset.
Now we will run the workflow. Remember, running CELLECT-LDSC requires having the snakemake library available (e.g. activate an environment with snakemake installed:)
conda activate <env_with_snakemake>
Then run the workflow:
snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile config.yml
The first time you run the workflow, snakemake will download and install local conda environments in ./.snakemake
. These environments ensure that all dependencies are correctly installed. CELLECT-LDSC is unlikely to work without the --use-conda
flag.
The above command is configured to output results in ./CELLECT-EXAMPLE
. To change this open the config.yml
file and edit the BASE_OUTPUT_DIR
to specify the output directory.
The config file is preconfigured to prioritize the two CELLEX specificity inputs for each of the two GWAS datasets we just downloaded.
Running the workflow should take 5-15 minutes depending on the available number of cores on your system. Here we run the workflow using all available cores on the computer (-j
). If you wish to use only 4 cores, just pass the -j 4
flag.
In ./CELLECT-EXAMPLE/CELLECT-LDSC/results/prioritization.csv
you will see the following prioritization output:
gwas,specificity_id,annotation,beta,beta_se,pvalue
BMI_Yengo2018,tabula_muris-test,Brain_Non-Myeloid.Bergmann_glial_cell,9.231153572792368e-09,4.019151323331188e-09,0.010815326414746898
BMI_Yengo2018,tabula_muris-test,Bladder.bladder_cell,-5.004712418748802e-10,3.2673818983229307e-09,0.5608686595651585
BMI_Yengo2018,tabula_muris-test,Brain_Myeloid.microglial_cell,-2.6351066098370078e-09,3.5627583125482456e-09,0.7702363432351039
BMI_Yengo2018,tabula_muris-test,Brain_Myeloid.macrophage,-4.20188757005791e-09,5.126292004971455e-09,0.7937989730524113
BMI_Yengo2018,tabula_muris-test,Bladder.bladder_urothelial_cell,-9.444074983296222e-09,3.099358846619276e-09,0.9988447189868584
EA3_Lee2018,tabula_muris-test,Brain_Non-Myeloid.Bergmann_glial_cell,8.49739420095679e-09,1.911986590915025e-09,4.409437303659968e-06
EA3_Lee2018,tabula_muris-test,Brain_Myeloid.macrophage,2.214934890235212e-09,2.716854415111798e-09,0.20746257567581028
EA3_Lee2018,tabula_muris-test,Brain_Myeloid.microglial_cell,-6.441126119804287e-10,1.753163408008973e-09,0.6433397441548794
EA3_Lee2018,tabula_muris-test,Bladder.bladder_urothelial_cell,-4.589906843053639e-09,1.6159150917314255e-09,0.9977474194521412
EA3_Lee2018,tabula_muris-test,Bladder.bladder_cell,-4.421016110716137e-09,1.3525251453635627e-09,0.9994598102836466
BMI_Yengo2018,mousebrain-test,ACBG,-7.391698263466479e-09,3.5090024712828697e-09,0.9824193375098678
BMI_Yengo2018,mousebrain-test,ABC,-9.899458888243979e-09,3.763677148602214e-09,0.9957340520312591
EA3_Lee2018,mousebrain-test,ACBG,1.3369407818499012e-09,1.9809168750306665e-09,0.2498664455262536
EA3_Lee2018,mousebrain-test,ABC,-7.091136899549197e-09,1.749536904646236e-09,0.9999747337967729
See Input & Output for a full description of the output files.