Merge pull request #119 from opain/dev

Dev
opain · Sep 10, 2024 · 87646a5 · 87646a5
2 parents b7ce68d + 55c67d6
commit 87646a5
Show file tree

Hide file tree

Showing 83 changed files with 9,774 additions and 912 deletions.
diff --git a/Scripts/target_scoring/target_scoring_pipeline.R b/Scripts/target_scoring/target_scoring_pipeline.R
@@ -69,6 +69,17 @@ if(!is.na(opt$test)){
 # Identify score files to be combined
 score_files<-list_score_files(opt$config)
 
+if(is.null(score_files)){
+  log_add(log_file = log_file, message = paste0('No score files specified.'))
+  end.time <- Sys.time()
+  time.taken <- end.time - start.time
+  sink(file = paste(opt$output,'.log',sep=''), append = T)
+  cat('Analysis finished at',as.character(end.time),'\n')
+  cat('Analysis duration was',as.character(round(time.taken,2)),attr(time.taken, 'units'),'\n')
+  sink()
+  quit(save = "no", status = 0)
+}
+
 # Subset score files
 if(!is.null(opt$score)){
   if(all(score_files$name != opt$score)){

diff --git a/docs/example_plink1-1_EUR.1_EUR-report.html b/docs/example_plink1-1_EUR.1_EUR-report.html
diff --git a/docs/example_plink1-report.html b/docs/example_plink1-report.html
diff --git a/docs/pipeline_technical.Rmd b/docs/pipeline_technical.Rmd
@@ -179,7 +179,7 @@ MegaPRS uses a range of priors (lasso, ridge, bolt, BayesR) for SNP effects, run
 
 #### PRS-CS
 
-PRS-CS, a Bayesian method using a continuous shrinkage prior, specifies a range of global shrinkage parameters (phi), generating multiple sets of genetic effects for polygenic scoring. Its 'auto' model estimates the optimal parameter directly from GWAS summary statistics, negating the need for an external dataset. In GenoPred, PRS-CS is run using the script [pgs_methods/prscs.R](https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R). GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the auto model. By default, GenoPred uses the PRS-CS provided 1KG-derived LD matrix data, matching the population of the GWAS sample. The user can select the UKB-derived LD matrix data to be used using the `prscs_ldref` parameter in the `configfile`. 1KG is used by default as PGS based on Yengo et al. sumstats performed significantly better in the OpenSNP target sample, when using the 1KG reference data (this may differ for other GWAS).
+PRS-CS, a Bayesian method using a continuous shrinkage prior, specifies a range of global shrinkage parameters (phi), generating multiple sets of genetic effects for polygenic scoring. Its 'auto' model estimates the optimal parameter directly from GWAS summary statistics, negating the need for an external dataset. In GenoPred, PRS-CS is run using the script [pgs_methods/prscs.R](https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R). By default, GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the auto model, but the user can modify this behaviour using the prscs_phi parameter in the configfile. By default, GenoPred uses the PRS-CS provided 1KG-derived LD matrix data, matching the population of the GWAS sample. The user can select the UKB-derived LD matrix data to be used using the `prscs_ldref` parameter in the `configfile`. 1KG is used by default as PGS based on Yengo et al. sumstats performed significantly better in the OpenSNP target sample, when using the 1KG reference data (this may differ for other GWAS).
 
 ***
 
@@ -229,7 +229,7 @@ Target genotype QC is performed using the [format_target.R](https://github.com/o
 
 ## Ancestry Inference
 
-Target samples then undergo ancestry inference, using the [Ancestry_identifier.R](https://github.com/opain/GenoPred/blob/master/Scripts/Ancestry_identifier/Ancestry_identifier.R) script, estimating the probability that each target individual matches each reference population (AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, CSA = Central and South Asian, MID = Middle Eastern). Population membership was predicted using a reference trained elastic net model consisting of the first six reference-projected genetic principal components. Principal components were defined in the reference dataset using variants present in the target dataset with a minor allele frequency >0.05, missingness <0.02 and Hardy-Weinberg p-value >1×10-6 (if target sample size <100, then only missingness threshold is applied in the target). LD pruning for independent variants is then performed in PLINK after removal of long-range LD regions (ref), using a window size of 1000, step size of 5, and r2 threshold of 0.2. The A multinomial elastic net model predicting super population membership in the reference is derived in using the glmnet R package, with model performance assessed using 5-fold cross validation. The reference-derived principal components are then projected into the target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, but the user can modify this threshold using the `ancestry_prob_thresh` parameter in the config file.
+Target samples then undergo ancestry inference, using the [Ancestry_identifier.R](https://github.com/opain/GenoPred/blob/master/Scripts/Ancestry_identifier/Ancestry_identifier.R) script, estimating the probability that each target individual matches each reference population (AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, CSA = Central and South Asian, MID = Middle Eastern). Population membership was predicted using a reference trained elastic net model consisting of the first six reference-projected genetic principal components. Principal components were defined in the reference dataset using variants present in the target dataset with a minor allele frequency >0.05, missingness <0.02 and Hardy-Weinberg p-value >1×10-6 (if target sample size <100, then only missingness threshold is applied in the target). LD pruning for independent variants is then performed in PLINK after removal of long-range LD regions (ref), using a window size of 1000, step size of 5, and r2 threshold of 0.2. The A multinomial elastic net model predicting super population membership in the reference is derived in using the glmnet R package, with model performance assessed using 5-fold cross validation. The reference-derived principal components are then projected into the target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, but the user can modify this threshold using the ancestry_prob_thresh parameter in the config file. If an individual does not have a predicted probability greater than the ancestry_prob_thresh parameter, then they will be excluded from downstream polygenic scoring. If the ancestry_prob_thresh parameter is low, then an individual may be assigned to multiple reference populations, and they will have polygenic scores that have been standardised according to each assigned reference population. In this case, the individual-level report created by GenoPred will present polygenic scores standardised according to the reference population with the highest predicted probability.
 
 ***
 
@@ -253,13 +253,13 @@ This step calculates scores in the target sample, based on scoring files from th
 
 ### Individual-level
 
-This step creates an .html report summarising the pipeline outputs for each individual in the target sample. It simply reads in pipeline outputs, and then tabulates and plots them. The only analysis it performs is the conversion of polygenic scores onto the absolute scale. It uses a [previously published method](https://pubmed.ncbi.nlm.nih.gov/34983942/). The estimate of the PGS R2 come from the lassosum pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived polygenic scores onto the absolutes scale. 
+This step creates an .html report summarising the pipeline outputs for each individual in the target sample. It simply reads in pipeline outputs, and then tabulates and plots them. The only analysis it performs is the conversion of polygenic scores onto the absolute scale. It uses a [previously published method](https://pubmed.ncbi.nlm.nih.gov/34983942/). The estimate of the PGS R2 come from the lassosum pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived polygenic scores onto the absolutes scale. An example of the individual-level report derived using the test data can be found <a href="example_plink1-1_EUR.1_EUR-report.html" target="_blank">here</a>.
 
 ***
 
 ### Sample-level
 
-This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then tabulates and plots them.
+This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then tabulates and plots them. An example of the sample-level report derived using the test data can be found <a href="example_plink1-report.html" target="_blank">here</a>.
 
 ***
 

diff --git a/docs/pipeline_technical.html b/docs/pipeline_technical.html
@@ -936,14 +936,16 @@ <h4>PRS-CS</h4>
 negating the need for an external dataset. In GenoPred, PRS-CS is run
 using the script <a
 href="https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R">pgs_methods/prscs.R</a>.
-GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the
-auto model. By default, GenoPred uses the PRS-CS provided 1KG-derived LD
-matrix data, matching the population of the GWAS sample. The user can
-select the UKB-derived LD matrix data to be used using the
-<code>prscs_ldref</code> parameter in the <code>configfile</code>. 1KG
-is used by default as PGS based on Yengo et al. sumstats performed
-significantly better in the OpenSNP target sample, when using the 1KG
-reference data (this may differ for other GWAS).</p>
+By default, GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1)
+and the auto model, but the user can modify this behaviour using the
+prscs_phi parameter in the configfile. By default, GenoPred uses the
+PRS-CS provided 1KG-derived LD matrix data, matching the population of
+the GWAS sample. The user can select the UKB-derived LD matrix data to
+be used using the <code>prscs_ldref</code> parameter in the
+<code>configfile</code>. 1KG is used by default as PGS based on Yengo et
+al. sumstats performed significantly better in the OpenSNP target
+sample, when using the 1KG reference data (this may differ for other
+GWAS).</p>
 <hr />
 </div>
 <div id="ptclump" class="section level4">
@@ -1081,8 +1083,16 @@ <h2>Ancestry Inference</h2>
 target dataset, and the reference-derived elastic net model is used to
 predict population membership in target. By default, target individuals
 are assigned to a population if the predicted probability was &gt;0.95,
-but the user can modify this threshold using the
-<code>ancestry_prob_thresh</code> parameter in the config file.</p>
+but the user can modify this threshold using the ancestry_prob_thresh
+parameter in the config file. If an individual does not have a predicted
+probability greater than the ancestry_prob_thresh parameter, then they
+will be excluded from downstream polygenic scoring. If the
+ancestry_prob_thresh parameter is low, then an individual may be
+assigned to multiple reference populations, and they will have polygenic
+scores that have been standardised according to each assigned reference
+population. In this case, the individual-level report created by
+GenoPred will present polygenic scores standardised according to the
+reference population with the highest predicted probability.</p>
 <hr />
 </div>
 <div id="within-target-qc" class="section level2">
@@ -1144,14 +1154,18 @@ <h3>Individual-level</h3>
 pseudovalidation analysis, and the distribution in the general
 population is provided by the user in the prev, mean and sd columns of
 the gwas_list. Note: It does not convert PGS from externally derived
-polygenic scores onto the absolutes scale.</p>
+polygenic scores onto the absolutes scale. An example of the
+individual-level report derived using the test data can be found
+<a href="example_plink1-1_EUR.1_EUR-report.html" target="_blank">here</a>.</p>
 <hr />
 </div>
 <div id="sample-level" class="section level3">
 <h3>Sample-level</h3>
 <p>This step creates an .html report summarising the pipeline outputs
 for each target sample. It simply reads in pipeline outputs, and then
-tabulates and plots them.</p>
+tabulates and plots them. An example of the sample-level report derived
+using the test data can be found
+<a href="example_plink1-report.html" target="_blank">here</a>.</p>
 <hr />
 </div>
 </div>

diff --git a/pipeline/envs/pgscatalog_utils.yaml b/pipeline/envs/pgscatalog_utils.yaml
@@ -5,4 +5,4 @@ dependencies:
   - python=3.10
   - pip
   - pip:
-      - poetry
+      - pgscatalog-core==0.2.2
diff --git a/pipeline/misc/dev/test_data/output/example_plink2/ancestry/example_plink2.Ancestry.log b/pipeline/misc/dev/test_data/output/example_plink2/ancestry/example_plink2.Ancestry.log
@@ -3,7 +3,7 @@
 # For questions contact Oliver Pain ([email protected])
 #################################################################
 # Repository: GenoPred
-# Version (tag): v2.2.2-102-g0438efa
+# Version (tag): v2.2.2-110-gb4e52b5
 ---------------
  Parameter        Value                                                                    
  target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr     
@@ -23,7 +23,7 @@
  help             FALSE                                                                    
  out_dir          misc/dev/test_data/output/example_plink2/ancestry/                       
 ---------------
-Analysis started at 2024-07-25 15:07:05
+Analysis started at 2024-07-25 17:38:49
 Lowering prob_thresh parameter to 0.5 for testing.
 Target sample size is <100 so only checking genotype missingness.
 587 variants match between target and reference after QC.
@@ -47,5 +47,5 @@ N per group based on model:
  MID        0
  Unassigned 2
 ----------
-Analysis finished at 2024-07-25 15:07:19 
-Analysis duration was 14.27 secs 
+Analysis finished at 2024-07-25 17:39:05 
+Analysis duration was 15.96 secs 
diff --git a/.../misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22.format_target.log b/.../misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22.format_target.log
@@ -3,7 +3,7 @@
 # For questions contact Oliver Pain ([email protected])
 #################################################################
 # Repository: GenoPred
-# Version (tag): v2.2.2-102-g0438efa
+# Version (tag): v2.2.2-110-gb4e52b5
 ---------------
  Parameter Value                                                                 
  target    misc/dev/test_data/target/example.chr22                               
@@ -15,7 +15,7 @@
  help      FALSE                                                                 
  out_dir   misc/dev/test_data/output/example_plink2/geno/                        
 ---------------
-Analysis started at 2024-07-25 15:07:00
+Analysis started at 2024-07-25 17:38:45
 Reading in reference SNP data.
 Reference data contains 1000 variants.
 Reading in target SNP data.
@@ -26,5 +26,5 @@ GRCh38 match: 0%
 Target contains 1000 reference variants.
 Removing 0 duplicate variants - May have IUPAC NA.
 Inserting missing reference variants.
-Analysis finished at 2024-07-25 15:07:01 
-Analysis duration was 0.7 secs 
+Analysis finished at 2024-07-25 17:38:46 
+Analysis duration was 0.58 secs 
diff --git a/pipeline/misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22.log b/pipeline/misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22.log
@@ -1,24 +1,24 @@
 PLINK v2.00a5.12LM 64-bit Intel (25 Jun 2024)
 Options in effect:
-  --bfile /scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/ref_targ
+  --bfile /scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/ref_targ
   --make-pgen
   --memory 5000
   --out misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22
-  --remove /scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/REF.psam
+  --remove /scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/REF.psam
   --threads 1
 
 Hostname: erc-hpc-comp179
 Working directory: /tools/GenoPred/pipeline
-Start time: Thu Jul 25 15:07:01 2024
+Start time: Thu Jul 25 17:38:46 2024
 
-Random number seed: 1721916421
-1031702 MiB RAM detected, ~1018552 available; reserving 5000 MiB for main
+Random number seed: 1721925526
+1031702 MiB RAM detected, ~1018519 available; reserving 5000 MiB for main
 workspace.
 Using 1 compute thread.
 3325 samples (1573 females, 1752 males; 3325 founders) loaded from
-/scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/ref_targ.fam.
+/scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/ref_targ.fam.
 1000 variants loaded from
-/scratch/prj/oliverpainfel/tmp/RtmpqFvaS6/ref_targ.bim.
+/scratch/prj/oliverpainfel/tmp/Rtmpjx5MIY/ref_targ.bim.
 Note: No phenotype data present.
 --remove: 12 samples remaining.
 12 samples (5 females, 7 males; 12 founders) remaining after main filters.
@@ -32,4 +32,4 @@ Writing
 misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr22.pgen ...
 done.
 
-End time: Thu Jul 25 15:07:01 2024
+End time: Thu Jul 25 17:38:46 2024
diff --git a/pipeline/misc/dev/test_data/output/example_plink2/pcs/projected/AFR/example_plink2-AFR.log b/pipeline/misc/dev/test_data/output/example_plink2/pcs/projected/AFR/example_plink2-AFR.log
@@ -3,7 +3,7 @@
 # For questions contact Oliver Pain ([email protected])
 #################################################################
 # Repository: GenoPred
-# Version (tag): v2.2.2-102-g0438efa
+# Version (tag): v2.2.2-110-gb4e52b5
 ---------------
  Parameter        Value                                                                                      
  target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr                       
@@ -19,9 +19,9 @@
  help             FALSE                                                                                      
  output_dir       misc/dev/test_data/output/example_plink2/pcs/projected/AFR/                                
 ---------------
-Analysis started at 2024-07-25 15:07:20
+Analysis started at 2024-07-25 17:39:07
 Calculating polygenic scores in the target sample.
 Scaling target polygenic scores to the reference.
 Saved polygenic scores to: misc/dev/test_data/output/example_plink2/pcs/projected/AFR/example_plink2-AFR.profiles.
-Analysis finished at 2024-07-25 15:07:20 
-Analysis duration was 0.22 secs 
+Analysis finished at 2024-07-25 17:39:07 
+Analysis duration was 0.27 secs 
diff --git a/pipeline/misc/dev/test_data/output/example_plink2/pcs/projected/CSA/example_plink2-CSA.log b/pipeline/misc/dev/test_data/output/example_plink2/pcs/projected/CSA/example_plink2-CSA.log
@@ -3,7 +3,7 @@
 # For questions contact Oliver Pain ([email protected])
 #################################################################
 # Repository: GenoPred
-# Version (tag): v2.2.2-102-g0438efa
+# Version (tag): v2.2.2-110-gb4e52b5
 ---------------
  Parameter        Value                                                                                      
  target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr                       
@@ -19,9 +19,9 @@
  help             FALSE                                                                                      
  output_dir       misc/dev/test_data/output/example_plink2/pcs/projected/CSA/                                
 ---------------
-Analysis started at 2024-07-25 15:07:22
+Analysis started at 2024-07-25 17:39:26
 Calculating polygenic scores in the target sample.
 Scaling target polygenic scores to the reference.
 Saved polygenic scores to: misc/dev/test_data/output/example_plink2/pcs/projected/CSA/example_plink2-CSA.profiles.
-Analysis finished at 2024-07-25 15:07:22 
-Analysis duration was 0.22 secs 
+Analysis finished at 2024-07-25 17:39:26 
+Analysis duration was 0.27 secs 
diff --git a/pipeline/misc/dev/test_data/output/example_plink2/pcs/projected/EAS/example_plink2-EAS.log b/pipeline/misc/dev/test_data/output/example_plink2/pcs/projected/EAS/example_plink2-EAS.log
@@ -3,7 +3,7 @@
 # For questions contact Oliver Pain ([email protected])
 #################################################################
 # Repository: GenoPred
-# Version (tag): v2.2.2-102-g0438efa
+# Version (tag): v2.2.2-110-gb4e52b5
 ---------------
  Parameter        Value                                                                                      
  target_plink_chr misc/dev/test_data/output/example_plink2/geno/example_plink2.ref.chr                       
@@ -19,9 +19,9 @@
  help             FALSE                                                                                      
  output_dir       misc/dev/test_data/output/example_plink2/pcs/projected/EAS/                                
 ---------------
-Analysis started at 2024-07-25 15:07:21
+Analysis started at 2024-07-25 17:39:10
 Calculating polygenic scores in the target sample.
 Scaling target polygenic scores to the reference.
 Saved polygenic scores to: misc/dev/test_data/output/example_plink2/pcs/projected/EAS/example_plink2-EAS.profiles.
-Analysis finished at 2024-07-25 15:07:22 
+Analysis finished at 2024-07-25 17:39:10 
 Analysis duration was 0.24 secs