Fine-mapping pipeline for Open Targets Genetics. In brief, the method is:
- Detect independent loci across the summary stat file using either (i) GCTA-cojo and a given plink file as an LD reference, (ii) distance based clumping. Method specified with
--method
argument. - If
--method conditional
, for each independent locus condition on all other surrounding loci (configurable withcojo_wind
). - Perform approximate Bayes factor credible set analysis for each independent locus.
For FinnGen, we incorporate each new release by directly taking the SuSIE fine-mapping outputs from FinnGen to determine top loci. Each time this is done, we have to be careful to NOT to run GCTA fine-mapping on FinnGen sumstats, since their SuSIE fine-mapping is superior (and we don't have a good LD reference).
- Spark v2.4.0
- GCTA (>= v1.91.3) must be available in
$PATH
- conda
- GNU parallel
git clone https://github.com/opentargets/genetics-finemapping.git
cd ~/genetics-finemapping
bash setup.sh
. ~/.profile # Reload profile so that conda works
conda env create -n finemap --file environment.yaml
Many of the pipeline parameters must first be specified in the analysis config file: configs/analysis.config.yaml
A single study can be fine-mapped using the single study wrapper
# Edit config file (this needs selecting with --config_file arg)
nano configs/analysis.config.yaml
# View args
$ python finemapping/single_study.wrapper.py --help
usage: single_study.wrapper.py [-h] --pq <file> --ld <file> --config_file
<file> --type <str> --study_id <str> --chrom
<str> [--phenotype_id <str>]
[--bio_feature <str>] --method
[conditional|distance] --pval_threshold <float>
--toploci <file> --credset <file> --log <file>
--tmpdir <file> [--delete_tmpdir]
optional arguments:
-h, --help show this help message and exit
--pq <file> Input: parquet file containing summary stats
--ld <file> Input: plink file to estimate LD from
--config_file <file> Input: analysis config file
--type <str> type to extract from pq
--study_id <str> study_id to extract from pq
--chrom <str> chrom to extract from pq
--phenotype_id <str> phenotype_id to extract from pq
--bio_feature <str> bio_feature to extract from pq
--method [conditional|distance]
Which method to run, either with conditional analysis
(gcta-cojo) or distance based with conditional
--pval_threshold <float>
P-value threshold to be considered "significant"
--run_finemap If True, then run FINEMAP
--toploci <file> Output: top loci json file
--credset <file> Output: credible set json file
--finemap <file> Output: finemap snp probabilities file
--log <file> Output: log file
--tmpdir <file> Output: temp dir
--delete_tmpdir Remove temp dir when complete
Note: The capability of running FINEMAP has been used but not extensively tested.
Prepare summary statistic files, including "significant" window extraction to reduce input size.
Prepare LD references in plink bed|bim|fam
format, currently using UK Biobank downsampled to 10K individuals and lifted over to GRCh38.
Download to local machine.
mkdir -p $HOME/data/ukb_v3_downsampled10k
gsutil -m rsync gs://open-targets-ukbb/genotypes/ukb_v3_downsampled10k/ $HOME/data/ukb_v3_downsampled10k/
# To optimise the GCTA conditioning step, it helps to split the UKB reference panel
# into smaller, overlapping "sub-panels". This is because (I believe) GCTA loads
# in the index for the whole chromosome (*.bim file) to determine which SNPs match.
time python 0_split_ld_reference.py --path $HOME/data/ukb_v3_downsampled10k/ukb_v3_chr{chrom}.downsampled10k
To process only new data, we should only download "significant windows" from new studies. The pipeline runs fine-mapping for all available significant windows.
cd $HOME/genetics-finemapping
mkdir -p $HOME/genetics-finemapping/data/filtered/significant_window_2mb/gwas/
mkdir -p $HOME/genetics-finemapping/data/filtered/significant_window_2mb/molecular_trait/
# Identify new GWAS studies: list all studies, order by date
gsutil ls -l gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/gwas/*/_SUCCESS | sort -k 2 > significant_window_gwas.ls.txt
# Manually copy the list of new file paths into "gwas_to_download.txt"
# Copy the .parquet folder names, not the _SUCCESS file names
# Download only new files into the destination folder
cat gwas_to_download.txt | gsutil -m cp -r -I $HOME/genetics-finemapping/data/filtered/significant_window_2mb/gwas/
# Identify new molecular trait studies: list studies, order by date
gsutil ls -l gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/molecular_trait/*/_SUCCESS | sort -k 2 > significant_window_moltrait.ls.txt
# Manually copy the list of new file paths into "moltrait_to_download.txt"
# Copy the .parquet folder names, not the _SUCCESS file names
# Download only new files into the destination folder
cat moltrait_to_download.txt | gsutil -m cp -r -I $HOME/genetics-finemapping/data/filtered/significant_window_2mb/molecular_trait/
#gsutil -m rsync -r gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/gwas/ $HOME/genetics-finemapping/data/filtered/significant_window_2mb/gwas/
#gsutil -m rsync -r gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/molecular_trait/ $HOME/genetics-finemapping/data/filtered/significant_window_2mb/molecular_trait/
#gsutil -m rsync -r gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/molecular_trait_new/ $HOME/genetics-finemapping/data/filtered/significant_window_2mb/molecular_trait/
# Remove FinnGen GWAS (if any), since we don't run fine-mapping for them
rm -r $HOME/genetics-finemapping/data/filtered/significant_window_2mb/gwas/FINNGEN*
# Remove extra files that can screw up data loading later
find /home/js29/genetics-finemapping/data/filtered/significant_window_2mb -name "*_SUCCESS" | wc -l
find /home/js29/genetics-finemapping/data/filtered/significant_window_2mb -name "*_SUCCESS" -delete
# Activate environment
source activate finemap
# Set spark paths
export PYSPARK_SUBMIT_ARGS="--driver-memory 80g pyspark-shell"
#export SPARK_HOME=/home/ubuntu/software/spark-2.4.0-bin-hadoop2.7
#export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-2.4.0-src.zip:$PYTHONPATH
The manifest file specifies all analyses to be run. The manifest is a JSON lines file with each line containing the following fields:
{
"type": "gwas",
"study_id": "NEALE2_50_raw",
"phenotype_id": null,
"bio_feature": null,
"chrom": "6",
"in_pq": "/home/ubuntu/data/sumstats/filtered/significant_window_2mb/gwas/NEALE2_50_raw.parquet",
"in_ld": "/home/ubuntu/data/genotypes/ukb_v3_downsampled10k_plink/ukb_v3_chr{chrom}.downsampled10k",
"out_top_loci": "/home/ubuntu/results/finemapping/output/study_id=NEALE2_50_raw/phenotype_id=None/bio_feature=None/chrom=6/top_loci.json.gz",
"out_credset": "/home/ubuntu/results/finemapping/output/study_id=NEALE2_50_raw/phenotype_id=None/bio_feature=None/chrom=6/credible_set.json.gz",
"out_log": "/home/ubuntu/results/finemapping/logs/study_id=NEALE2_50_raw/phenotype_id=None/bio_feature=None/chrom=6/logfile.txt",
"tmpdir": "/home/ubuntu/results/finemapping/tmp/study_id=NEALE2_50_raw/phenotype_id=None/bio_feature=None/chrom=6",
"method": "conditional",
"pval_threshold": 5e-08
}
Note that the wildcard {chrom}
can be used in in_ld
field.
The manifest file can be automatically generated using:
cd ~/genetics-finemapping
# Edit the Args and Paths in `1_scan_input_parquets.py`
nano 1_scan_input_parquets.py
# Reads variants filtered for p value, and outputs a single json record in
# tmp/filtered_input for each study/chromosome combination with at least one
# significant variant. Takes a couple of minutes for 200 GWAS.
time python 1_scan_input_parquets.py
# Creates manifest file, one job per study/chromosome. Output path `configs/manifest.json.gz`
python 2_make_manifest.py
mkdir logs
tmux # So run continues if connection is lost
# Set number of cores based on machine size used, then run all commands
NCORES=63
time bash 4_run_commands.sh $NCORES | tee logs/run_pipeline.out.txt
# Commands can be regenerated and run separately if needed
#python 3_make_commands.py --quiet
#time zcat commands_todo.txt.gz | shuf | parallel -j $NCORES --bar --joblog logs/parallel.jobs.log | tee logs/run_pipeline.out2.txt 2>&1
# Exit tmux with Ctrl+b then d
The above command will run all analyses specified in the manifest using GNU parallel. It will create two files commands_todo.txt.gz
and commands_done.txt.gz
showing which analyses have not yet/already been done. The pipeline can be stopped at any time and restarted without repeating any completed analyses. You can safely regenerate the commands_*.txt.gz
commands whilst the pipeline is running using python 3_make_commands.py --quiet
.
If you get this error: ModuleNotFoundError: No module named 'dask' then I've solved it just by deactivating conda and reactivating. This seems to happen especially when using tmux... I'm not sure why.
rm -r /home/js29/genetics-finemapping/tmp/*
# Combine the results of all the individual analyses
# This step can be slow/inefficient due to Hadoop many small files problem
# You are likely to get out of memory errors if you don't increase the java
# heap space available to Spark with PYSPARK_SUBMIT_ARGS.
export PYSPARK_SUBMIT_ARGS="--driver-memory 80g pyspark-shell"
time python 5_combine_results.py
# Make a note as to what this finemapping run contained. E.g.:
echo "Run with new GTEx sQTL dataset, and updated GWAS catalog studies. Re-ran the 8 studies that had flipped betas previously (IBD and lipids)." > results/README.txt
# Copy the results to GCS
version_date=`date +%y%m%d`
#version_date=220228
bash 6_copy_results_to_gcs.sh $version_date
Number of top_loci raw: 1,623,534 Number of top_loci after dups removed: 1,541,938 Number of top_loci in previous version: 689,726
Number of credset rows raw: 44,180,640 Number of credset rows after dups removed: 40,910,064
Steps like the below are needed if we are adding on to previous fine-mapping results, rather than recomputing everything. We assume that studies for which fine-mapping has been run are not present in the previous fine-mapped results that we are merging onto, otherwise we may get duplicates.
# Copy down previous fine-mapping results into temp folder
mkdir -p finemapping_to_merge/220113_merged
gsutil -m rsync -r gs://genetics-portal-dev-staging/finemapping/220113_merged finemapping_to_merge/220113_merged
mkdir -p finemapping_to_merge/$version_date/
cp -r results/* finemapping_to_merge/$version_date/
#
# Merge all old finemapping results with new
#
mkdir -p finemapping_merged
time python 7_merge_finemap_results_fix.py --prev_results finemapping_to_merge/220113_merged --new_results finemapping_to_merge/$version_date/ --output finemapping_merged | tee finemapping_merged/merge_results.log
# NOTE:
# If adding new FinnGen results, then skip this step
echo "Merged fine-mapping results from 220113_merged + 220224" > finemapping_merged/README.txt
gsutil -m rsync -r $HOME/genetics-finemapping/finemapping_merged/ gs://genetics-portal-dev-staging/finemapping/${version_date}_merged
# If there are all-new FinnGen results, then pass the --remove_previous_finngen flag.
mkdir -p finemapping_merged_w_finngen
time python 7_merge_finemap_results.py --prev_results finemapping_merged --new_results finngen/results/ --output finemapping_merged_w_finngen --remove_previous_finngen | tee finemapping_merged_w_finngen/merge_finngen.log
echo "Merged fine-mapping results from 220113_merged + 220224 + FinnGen R6. Removed 3 studies with bad data: GCST007236, GCST007799, GCST007800." > finemapping_merged_w_finngen/README.txt
gsutil -m rsync -r $HOME/genetics-finemapping/finemapping_merged_w_finngen/ gs://genetics-portal-dev-staging/finemapping/${version_date}_merged
I did a test run on two different VM instances where I fine-mapped 15 GWAS. One VM had a balanced persistent disk (200 Gb), one had an SSD (200 Gb). Otherwise they both were N2-standard-8 configurations. The result was that the SSD version took about 4% longer than the standard disk. I did not try with a local SSD, but I suspect that the disk makes no difference, since the pipeline is CPU-bound.
# Parse time taken for each run
grep "Time taken" logs/study_id=*/phenotype_id=*/bio_feature=*/chrom=*/logfile.txt
ls -rt logs/study_id=*/phenotype_id=*/bio_feature=*/chrom=*/logfile.txt | xargs grep "Time taken"
# List all
ls logs/study_id=*/phenotype_id=*/bio_feature=*/chrom=*/logfile.txt
- Currently fails for sex chromosomes
- Need to replace X with 23 in plink file or when specifying gcta command
- Need to impute sex in plink file for X for cojo to work
- Manifest NAs must be represented with "None"
- P-value threshold is specified in 1_scan_input_parquets.py. Set to 5e-8 for GWAS, and (0.05 / num_tests) for mol trait
# The below steps were used when we found duplicate top_loci, which was due
# to duplicated lines in the eQTL catalogue ingest. This has since been fixed,
# so the below should not be needed.
# Concatenate together all top_loci and credset files
time find output -name "top_loci.json.gz" | while read -r file; do zcat -f "$file"; done | gzip > top_loci.concat.json.gz &
time find output -name "credible_set.json.gz" | while read -r file; do zcat -f "$file"; done | gzip > credible_set.concat.json.gz
# Remove duplicates
# This should only be necessary because when we last ingested eQTL catalogue
# I failed to remove duplicate rows first.
time zcat top_loci.concat.json.gz | sort | uniq | gzip > top_loci.dedup.json.gz &
time zcat credible_set.concat.json.gz | sort | uniq | gzip > credible_set.dedup.json.gz
time python 5_combine_results_rmdup.py