Skip to content

Commit

Permalink
Merge pull request #119 from opain/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
opain authored Sep 10, 2024
2 parents b7ce68d + 55c67d6 commit 87646a5
Show file tree
Hide file tree
Showing 83 changed files with 9,774 additions and 912 deletions.
11 changes: 11 additions & 0 deletions Scripts/target_scoring/target_scoring_pipeline.R
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,17 @@ if(!is.na(opt$test)){
# Identify score files to be combined
score_files<-list_score_files(opt$config)

if(is.null(score_files)){
log_add(log_file = log_file, message = paste0('No score files specified.'))
end.time <- Sys.time()
time.taken <- end.time - start.time
sink(file = paste(opt$output,'.log',sep=''), append = T)
cat('Analysis finished at',as.character(end.time),'\n')
cat('Analysis duration was',as.character(round(time.taken,2)),attr(time.taken, 'units'),'\n')
sink()
quit(save = "no", status = 0)
}

# Subset score files
if(!is.null(opt$score)){
if(all(score_files$name != opt$score)){
Expand Down
4,583 changes: 4,583 additions & 0 deletions docs/example_plink1-1_EUR.1_EUR-report.html

Large diffs are not rendered by default.

4,128 changes: 4,128 additions & 0 deletions docs/example_plink1-report.html

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/pipeline_technical.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ MegaPRS uses a range of priors (lasso, ridge, bolt, BayesR) for SNP effects, run

#### PRS-CS

PRS-CS, a Bayesian method using a continuous shrinkage prior, specifies a range of global shrinkage parameters (phi), generating multiple sets of genetic effects for polygenic scoring. Its 'auto' model estimates the optimal parameter directly from GWAS summary statistics, negating the need for an external dataset. In GenoPred, PRS-CS is run using the script [pgs_methods/prscs.R](https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R). GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the auto model. By default, GenoPred uses the PRS-CS provided 1KG-derived LD matrix data, matching the population of the GWAS sample. The user can select the UKB-derived LD matrix data to be used using the `prscs_ldref` parameter in the `configfile`. 1KG is used by default as PGS based on Yengo et al. sumstats performed significantly better in the OpenSNP target sample, when using the 1KG reference data (this may differ for other GWAS).
PRS-CS, a Bayesian method using a continuous shrinkage prior, specifies a range of global shrinkage parameters (phi), generating multiple sets of genetic effects for polygenic scoring. Its 'auto' model estimates the optimal parameter directly from GWAS summary statistics, negating the need for an external dataset. In GenoPred, PRS-CS is run using the script [pgs_methods/prscs.R](https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R). By default, GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the auto model, but the user can modify this behaviour using the prscs_phi parameter in the configfile. By default, GenoPred uses the PRS-CS provided 1KG-derived LD matrix data, matching the population of the GWAS sample. The user can select the UKB-derived LD matrix data to be used using the `prscs_ldref` parameter in the `configfile`. 1KG is used by default as PGS based on Yengo et al. sumstats performed significantly better in the OpenSNP target sample, when using the 1KG reference data (this may differ for other GWAS).

***

Expand Down Expand Up @@ -229,7 +229,7 @@ Target genotype QC is performed using the [format_target.R](https://github.com/o

## Ancestry Inference

Target samples then undergo ancestry inference, using the [Ancestry_identifier.R](https://github.com/opain/GenoPred/blob/master/Scripts/Ancestry_identifier/Ancestry_identifier.R) script, estimating the probability that each target individual matches each reference population (AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, CSA = Central and South Asian, MID = Middle Eastern). Population membership was predicted using a reference trained elastic net model consisting of the first six reference-projected genetic principal components. Principal components were defined in the reference dataset using variants present in the target dataset with a minor allele frequency >0.05, missingness <0.02 and Hardy-Weinberg p-value >1×10-6 (if target sample size <100, then only missingness threshold is applied in the target). LD pruning for independent variants is then performed in PLINK after removal of long-range LD regions (ref), using a window size of 1000, step size of 5, and r2 threshold of 0.2. The A multinomial elastic net model predicting super population membership in the reference is derived in using the glmnet R package, with model performance assessed using 5-fold cross validation. The reference-derived principal components are then projected into the target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, but the user can modify this threshold using the `ancestry_prob_thresh` parameter in the config file.
Target samples then undergo ancestry inference, using the [Ancestry_identifier.R](https://github.com/opain/GenoPred/blob/master/Scripts/Ancestry_identifier/Ancestry_identifier.R) script, estimating the probability that each target individual matches each reference population (AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, CSA = Central and South Asian, MID = Middle Eastern). Population membership was predicted using a reference trained elastic net model consisting of the first six reference-projected genetic principal components. Principal components were defined in the reference dataset using variants present in the target dataset with a minor allele frequency >0.05, missingness <0.02 and Hardy-Weinberg p-value >1×10-6 (if target sample size <100, then only missingness threshold is applied in the target). LD pruning for independent variants is then performed in PLINK after removal of long-range LD regions (ref), using a window size of 1000, step size of 5, and r2 threshold of 0.2. The A multinomial elastic net model predicting super population membership in the reference is derived in using the glmnet R package, with model performance assessed using 5-fold cross validation. The reference-derived principal components are then projected into the target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, but the user can modify this threshold using the ancestry_prob_thresh parameter in the config file. If an individual does not have a predicted probability greater than the ancestry_prob_thresh parameter, then they will be excluded from downstream polygenic scoring. If the ancestry_prob_thresh parameter is low, then an individual may be assigned to multiple reference populations, and they will have polygenic scores that have been standardised according to each assigned reference population. In this case, the individual-level report created by GenoPred will present polygenic scores standardised according to the reference population with the highest predicted probability.

***

Expand All @@ -253,13 +253,13 @@ This step calculates scores in the target sample, based on scoring files from th

### Individual-level

This step creates an .html report summarising the pipeline outputs for each individual in the target sample. It simply reads in pipeline outputs, and then tabulates and plots them. The only analysis it performs is the conversion of polygenic scores onto the absolute scale. It uses a [previously published method](https://pubmed.ncbi.nlm.nih.gov/34983942/). The estimate of the PGS R2 come from the lassosum pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived polygenic scores onto the absolutes scale.
This step creates an .html report summarising the pipeline outputs for each individual in the target sample. It simply reads in pipeline outputs, and then tabulates and plots them. The only analysis it performs is the conversion of polygenic scores onto the absolute scale. It uses a [previously published method](https://pubmed.ncbi.nlm.nih.gov/34983942/). The estimate of the PGS R2 come from the lassosum pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived polygenic scores onto the absolutes scale. An example of the individual-level report derived using the test data can be found <a href="example_plink1-1_EUR.1_EUR-report.html" target="_blank">here</a>.

***

### Sample-level

This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then tabulates and plots them.
This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then tabulates and plots them. An example of the sample-level report derived using the test data can be found <a href="example_plink1-report.html" target="_blank">here</a>.

***

Expand Down
38 changes: 26 additions & 12 deletions docs/pipeline_technical.html
Original file line number Diff line number Diff line change
Expand Up @@ -936,14 +936,16 @@ <h4>PRS-CS</h4>
negating the need for an external dataset. In GenoPred, PRS-CS is run
using the script <a
href="https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R">pgs_methods/prscs.R</a>.
GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the
auto model. By default, GenoPred uses the PRS-CS provided 1KG-derived LD
matrix data, matching the population of the GWAS sample. The user can
select the UKB-derived LD matrix data to be used using the
<code>prscs_ldref</code> parameter in the <code>configfile</code>. 1KG
is used by default as PGS based on Yengo et al. sumstats performed
significantly better in the OpenSNP target sample, when using the 1KG
reference data (this may differ for other GWAS).</p>
By default, GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1)
and the auto model, but the user can modify this behaviour using the
prscs_phi parameter in the configfile. By default, GenoPred uses the
PRS-CS provided 1KG-derived LD matrix data, matching the population of
the GWAS sample. The user can select the UKB-derived LD matrix data to
be used using the <code>prscs_ldref</code> parameter in the
<code>configfile</code>. 1KG is used by default as PGS based on Yengo et
al. sumstats performed significantly better in the OpenSNP target
sample, when using the 1KG reference data (this may differ for other
GWAS).</p>
<hr />
</div>
<div id="ptclump" class="section level4">
Expand Down Expand Up @@ -1081,8 +1083,16 @@ <h2>Ancestry Inference</h2>
target dataset, and the reference-derived elastic net model is used to
predict population membership in target. By default, target individuals
are assigned to a population if the predicted probability was &gt;0.95,
but the user can modify this threshold using the
<code>ancestry_prob_thresh</code> parameter in the config file.</p>
but the user can modify this threshold using the ancestry_prob_thresh
parameter in the config file. If an individual does not have a predicted
probability greater than the ancestry_prob_thresh parameter, then they
will be excluded from downstream polygenic scoring. If the
ancestry_prob_thresh parameter is low, then an individual may be
assigned to multiple reference populations, and they will have polygenic
scores that have been standardised according to each assigned reference
population. In this case, the individual-level report created by
GenoPred will present polygenic scores standardised according to the
reference population with the highest predicted probability.</p>
<hr />
</div>
<div id="within-target-qc" class="section level2">
Expand Down Expand Up @@ -1144,14 +1154,18 @@ <h3>Individual-level</h3>
pseudovalidation analysis, and the distribution in the general
population is provided by the user in the prev, mean and sd columns of
the gwas_list. Note: It does not convert PGS from externally derived
polygenic scores onto the absolutes scale.</p>
polygenic scores onto the absolutes scale. An example of the
individual-level report derived using the test data can be found
<a href="example_plink1-1_EUR.1_EUR-report.html" target="_blank">here</a>.</p>
<hr />
</div>
<div id="sample-level" class="section level3">
<h3>Sample-level</h3>
<p>This step creates an .html report summarising the pipeline outputs
for each target sample. It simply reads in pipeline outputs, and then
tabulates and plots them.</p>
tabulates and plots them. An example of the sample-level report derived
using the test data can be found
<a href="example_plink1-report.html" target="_blank">here</a>.</p>
<hr />
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion pipeline/envs/pgscatalog_utils.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ dependencies:
- python=3.10
- pip
- pip:
- poetry
- pgscatalog-core==0.2.2
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# For questions contact Oliver Pain ([email protected])
#################################################################
# Repository: GenoPred
# Version (tag): v2.2.2-102-g0438efa
# Version (tag): v2.2.2-110-gb4e52b5
---------------
Parameter Value
target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr
Expand All @@ -23,7 +23,7 @@
help FALSE
out_dir misc/dev/test_data/output/example_plink2/ancestry/
---------------
Analysis started at 2024-07-25 15:07:05
Analysis started at 2024-07-25 17:38:49
Lowering prob_thresh parameter to 0.5 for testing.
Target sample size is <100 so only checking genotype missingness.
587 variants match between target and reference after QC.
Expand All @@ -47,5 +47,5 @@ N per group based on model:
MID 0
Unassigned 2
----------
Analysis finished at 2024-07-25 15:07:19
Analysis duration was 14.27 secs
Analysis finished at 2024-07-25 17:39:05
Analysis duration was 15.96 secs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# For questions contact Oliver Pain ([email protected])
#################################################################
# Repository: GenoPred
# Version (tag): v2.2.2-102-g0438efa
# Version (tag): v2.2.2-110-gb4e52b5
---------------
Parameter Value
target misc/dev/test_data/target/example.chr22
Expand All @@ -15,7 +15,7 @@
help FALSE
out_dir misc/dev/test_data/output/example_plink2/geno/
---------------
Analysis started at 2024-07-25 15:07:00
Analysis started at 2024-07-25 17:38:45
Reading in reference SNP data.
Reference data contains 1000 variants.
Reading in target SNP data.
Expand All @@ -26,5 +26,5 @@ GRCh38 match: 0%
Target contains 1000 reference variants.
Removing 0 duplicate variants - May have IUPAC NA.
Inserting missing reference variants.
Analysis finished at 2024-07-25 15:07:01
Analysis duration was 0.7 secs
Analysis finished at 2024-07-25 17:38:46
Analysis duration was 0.58 secs
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
PLINK v2.00a5.12LM 64-bit Intel (25 Jun 2024)
Options in effect:
--bfile /scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/ref_targ
--bfile /scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/ref_targ
--make-pgen
--memory 5000
--out misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22
--remove /scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/REF.psam
--remove /scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/REF.psam
--threads 1

Hostname: erc-hpc-comp179
Working directory: /tools/GenoPred/pipeline
Start time: Thu Jul 25 15:07:01 2024
Start time: Thu Jul 25 17:38:46 2024

Random number seed: 1721916421
1031702 MiB RAM detected, ~1018552 available; reserving 5000 MiB for main
Random number seed: 1721925526
1031702 MiB RAM detected, ~1018519 available; reserving 5000 MiB for main
workspace.
Using 1 compute thread.
3325 samples (1573 females, 1752 males; 3325 founders) loaded from
/scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/ref_targ.fam.
/scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/ref_targ.fam.
1000 variants loaded from
/scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/ref_targ.bim.
/scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/ref_targ.bim.
Note: No phenotype data present.
--remove: 12 samples remaining.
12 samples (5 females, 7 males; 12 founders) remaining after main filters.
Expand All @@ -32,4 +32,4 @@ Writing
misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22.pgen ...
done.

End time: Thu Jul 25 15:07:01 2024
End time: Thu Jul 25 17:38:46 2024
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# For questions contact Oliver Pain ([email protected])
#################################################################
# Repository: GenoPred
# Version (tag): v2.2.2-102-g0438efa
# Version (tag): v2.2.2-110-gb4e52b5
---------------
Parameter Value
target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr
Expand All @@ -19,9 +19,9 @@
help FALSE
output_dir misc/dev/test_data/output/example_plink2/pcs/projected/AFR/
---------------
Analysis started at 2024-07-25 15:07:20
Analysis started at 2024-07-25 17:39:07
Calculating polygenic scores in the target sample.
Scaling target polygenic scores to the reference.
Saved polygenic scores to: misc/dev/test_data/output/example_plink2/pcs/projected/AFR/example_plink2-AFR.profiles.
Analysis finished at 2024-07-25 15:07:20
Analysis duration was 0.22 secs
Analysis finished at 2024-07-25 17:39:07
Analysis duration was 0.27 secs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# For questions contact Oliver Pain ([email protected])
#################################################################
# Repository: GenoPred
# Version (tag): v2.2.2-102-g0438efa
# Version (tag): v2.2.2-110-gb4e52b5
---------------
Parameter Value
target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr
Expand All @@ -19,9 +19,9 @@
help FALSE
output_dir misc/dev/test_data/output/example_plink2/pcs/projected/CSA/
---------------
Analysis started at 2024-07-25 15:07:22
Analysis started at 2024-07-25 17:39:26
Calculating polygenic scores in the target sample.
Scaling target polygenic scores to the reference.
Saved polygenic scores to: misc/dev/test_data/output/example_plink2/pcs/projected/CSA/example_plink2-CSA.profiles.
Analysis finished at 2024-07-25 15:07:22
Analysis duration was 0.22 secs
Analysis finished at 2024-07-25 17:39:26
Analysis duration was 0.27 secs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# For questions contact Oliver Pain ([email protected])
#################################################################
# Repository: GenoPred
# Version (tag): v2.2.2-102-g0438efa
# Version (tag): v2.2.2-110-gb4e52b5
---------------
Parameter Value
target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr
Expand All @@ -19,9 +19,9 @@
help FALSE
output_dir misc/dev/test_data/output/example_plink2/pcs/projected/EAS/
---------------
Analysis started at 2024-07-25 15:07:21
Analysis started at 2024-07-25 17:39:10
Calculating polygenic scores in the target sample.
Scaling target polygenic scores to the reference.
Saved polygenic scores to: misc/dev/test_data/output/example_plink2/pcs/projected/EAS/example_plink2-EAS.profiles.
Analysis finished at 2024-07-25 15:07:22
Analysis finished at 2024-07-25 17:39:10
Analysis duration was 0.24 secs
Loading

0 comments on commit 87646a5

Please sign in to comment.