Skip to content

Reproducible code for running the analyses in the paper "Fast and Accurate Bayesian Polygenic Risk Modeling with Variational Inference" (2022)

Notifications You must be signed in to change notification settings

shz9/viprs-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast and Accurate Bayesian Polygenic Risk Modeling with Variational Inference (2022)

By: Shadi Zabad, Simon Gravel, and Yue Li
McGill University

This repository contains reproducible code to replicate the analyses in our upcoming publication describing the Variational Inference of Polygenic Risk Scores (VIPRS) method. The repository includes scripts for all the steps in the analyses, from data pre-processing and QC, all the way to figure generation and plotting.

If you would like to run this method on your dataset of choice, please consult the pages for the main software packages powering these analyses:

In what follows, we will describe the steps and the workflow to replicate the analyses described in the paper.

Required data

In terms of the data that you need to reproduce the analyses, there are three main data sources:

  1. The 1000 Genomes Project genotype data: This data source is used to compute reference LD matrices for analyses with summary statistics. You can download 1000G data from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

  2. The 1000G Phase III genetic map: This is used to extract SNP positions along the chromosome in Centi Morgan (cM). This is important to define LD boundaries for each SNP (e.g. by default, each SNP has a window of 3cM around it). You can download the genetic map into the desired folder as follows:

wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
tar -xvzf 1000GP_Phase3.tgz
rm 1000GP_Phase3/*.gz
  1. The UK Biobank (UKB) genotype and phenotype data: You will need to apply for access via the AMS portal to gain access to this dataset.

Getting started

In order to replicate these analyses, we assume that the user has access to a compute cluster with:

  • SLURM workload manager system.
  • python 3.7

You can replicate some of these analyses on your personal machine, but you would have to significantly modify (some) the scripts provided.

To get started, navigate to a folder of choice on your cluster and execute the following commands:

git clone https://github.com/shz9/viprs-paper.git
cd viprs-paper
source starter_script.sh

This will clone the current repository from GitHub, and execute the starter script, which installs all the required python dependencies in a new virtual environment called pyenv. If you wish to change the name of the virtual environment or its location, be mindful that you may need to propagate this change to some of the downstream scripts. The starter script also compiles the cython code in the magenpy and viprs packages.

Once the previous steps executed successfully, you will need to inspect the global_config.sh script to set the relevant configurations and paths on your system. In particular, make sure to change the paths for:

  • The UK Biobank genotype data (the imputed .bgen files)
  • The UK Biobank phenotype data
  • The directory for the files with the genetic map.

The UKB files structure on your system may differ from what we assume in our scripts, so it is also a good idea to inspect some of the downstream scripts to make sure that the specified paths are valid.

Data pre-processing and QC

To prepare the raw genotype data for the downstream analyses, you will need to run the scripts in the data_preparation folder. The following order of operations is recommended, though some of these steps can in principle be run in parallel:

  1. Ensure that the pyenv environment is activated:
source ~/pyenv/bin/activate
  1. Run the sample and SNP QC filtering scripts:
python data_preparation/generate_qc_filters.py
  1. Submit the batch jobs to perform QC and filtering on the genotype data and transform it into .bed format:
source data_preparation/batch_qc.sh
  1. Harmonize the genotype data across chromosomes:
source data_preparation/harmonize_ukbb_genotypes.sh
  1. Submit batch job to compute the LD matrices (you may modify this script to only compute matrices for the windowed estimator with 50k sample size):
source data_preparation/batch_ld.sh 
  1. Extract and transform real phenotypes for the UKB participants:
python data_preparation/prepare_real_phenotypes.py
  1. Split the samples into 5 training/validation/testing folds:
python data_preparation/generate_cv_folds.py
  1. Generate a table of the effect sample size for each phenotype:
python data_preparation/generate_train_sample_size.py
  1. Download and transform external GWAS summary statistics:
source data_preparation/download_external_sumstats.sh
python data_preparation/transform_external_sumstats.py

Simulations

To generate simulated quantitative and binary traits, you can submit genome-wide simulation jobs by invoking the command:

source simulation/batch_simulations.sh

GWAS summary statistics

To generate GWAS summary statistics, we use the plink2 software. To submit jobs that generate GWAS summary statistics to all simulated and real phenotypes for the UK Biobank participants, invoke the following script:

python gwas/batch_gwas.py

Model fitting

Once the GWAS summary statistics have been generated, the next step is to perform model fitting. For the viprs software, you can perform model fitting by following the analysis commands in model_fit/analysis_commands.sh. For instance, to run the standard VIPRS model shown in the main figures of the paper, you can submit jobs as follows:

python model_fit/batch_model_fit.py -m VIPRS -l ukbb_50k_windowed

Whereas to run the grid search version of the model VIPRS-GS, you need to modify the command above to the following:

python model_fit/batch_model_fit.py -m VIPRS -l ukbb_50k_windowed --strategy GS --opt-params pi --grid-metric validation

The script fit_prs.py shows how to use the viprs software for model fitting, if you need to tweak it for your particular dataset.


To run external PRS methods on the GWAS summary statistics, you need to execute the setup script for each method separately and then submit the jobs. The setup scripts download the software, create virtual environments, and install dependencies for each PRS method. For example, to run Lassosum, all you need is to execute the setup file and then submit the jobs to perform model fitting:

source external/Lassosum/setup_lassosum_environment.sh
python external/Lassosum/batch_lassosum.py

For some methods, such as SBayesR, you may need to transform the GWAS summary statistics into, e.g. the COJO format. In that case, you may need to execute the transform_sumstats.py script before you run the method:

source external/SBayesR/setup_sbayesr_environment.sh
source external/SBayesR/transform_sumstats_job.sh
python external/SBayesR/batch_sbayesr.py

Individual scoring (Generating PGS)

To generate PRS for individuals in the test sets, you need to run the scoring scripts for each method. Some of these commands are listed in the score/analysis_commands.sh script. For example, to generate PGS from the standard VIPRS model, you can submit jobs as follows:

python score/batch_score.py -m VIPRS -l ukbb_50k_windowed

To generate PGS for an external method, such as SBayesR, you can also submit jobs as follows:

python score/batch_score.py -m SBayesR -l external

Evaluation

Once the polygenic scores for individuals in the test set have been generated, the next step is to evaluate the predictive performance of the various PRS models. To do this, you can simply execute the following python script:

python evaluation/predictive_performance.py

10m SNPs analysis

To replicate the analysis with the 10 million SNPs, you need to execute the scripts in the analysis_10m_snps directory. The order of operations is similar to what we described above. You need to start with executing the scripts in the data preprocessing folder (analysis_10m_snps/data_preparation), then moving onto the GWAS step (analysis_10m_snps/gwas), model fitting (analysis_10m_snps/model_fit), scoring (analysis_10m_snps/score), and evaluation (analysis_10m_snps/evaluation).

Multi-population analysis

Similarly, to replicate the multi-population analysis, you need to execute the scripts in multi_pop_analysis, starting with data preprocessing (multi_pop_analysis/data_preparation), scoring (multi_pop_analysis/score), and evaluation (multi_pop_analysis/evaluation).

Generating figures and plots

Finally, to generate the figures and plots you see in the manuscript, you can execute the bash script in the plotting directory:

source plotting/batch_plot.sh

New analyses in v2 of the manuscript

In version 2 of the manuscript, we added new simulation scenarios and included MegaPRS in the list of baseline models that we compare against.

The new simulation scripts can be found in simulation_supplementary directory.

In addition to the simulations, I also added new scripts for performing model fit and computing LD matrices using the latest versions of the viprs and `magenpy software implementations.

Questions and inquiries

If you have any questions or inquiries about any of the steps in this analysis, feel free to open an issue under this GitHub repo or email the corresponding authors on the manuscript.

About

Reproducible code for running the analyses in the paper "Fast and Accurate Bayesian Polygenic Risk Modeling with Variational Inference" (2022)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published