Skip to content

Commit

Permalink
Merge pull request #20 from uclahs-cds/nkwang-update-Rscripts
Browse files Browse the repository at this point in the history
Update Rscripts
  • Loading branch information
nwiltsie authored Aug 30, 2024
2 parents 5c2bc28 + 9418d9c commit dda927b
Show file tree
Hide file tree
Showing 15 changed files with 446 additions and 500 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
- Add workflow for SV caller (Delly2)
- Add pipeline diagram
- Add reverse liftover (GRCh38 -> GRCh37) for SNV branch
- Add reverse liftover (GRCh38 -> GRCh37) for SV branch
- Add optional `target_threshold` and `target_specificity` parameters

### Changed

Expand Down
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,11 @@ If you are using the UCLA Azure cluster, please use the [submission script](http
- For SNVs, convert variant coordinates using the `BCFtools` LiftOver plugin with UCSC chain files.
- For SVs, convert variant breakpoint coordinates using custom R script with UCSC chain files and `rtracklayer` and `GenomicRanges` R packages.

### 2. Variant annotation
### 2. Variant annotation*

- For SNVs, add dbSNP, GENCODE, and HGNC annotations using GATK's Funcotator. Add trinucleotide context and RepeatMasker intervals with `bedtools`.
- For SVs, annotate variants with population allele frequency from the gnomAD-SV v4 database.
- *Variant annotation occurs prior to LiftOver when converting from GRCh38 -> GRCh37

### 3. Predict variant stability

Expand Down Expand Up @@ -98,6 +99,8 @@ input:

| Optional Parameter | Type | Default | Description |
| --------------------------- | ----------------------------------------------------------------------------------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `target_threshold` | numeric | `""` | Target Stability Score threshold for variant filtering: [0, 1] |
| `target_specificity` | numeric | `""` | Target specificity based on whole genome validation set for variant filtering: [0, 1] |
| `work_dir` | path | `/scratch/$SLURM_JOB_ID` | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With `ucla_cds`, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively. |
| `save_intermediate_files` | boolean | false | If set, save output files from intermediate pipeline processes. |
| `min_cpus` | int | 1 | Minimum number of CPUs that can be assigned to each process. |
Expand All @@ -117,13 +120,13 @@ The docker images in the following table are generally defined like `docker_imag

* Change `params.docker_container_registry`. This will affect all of the images (except for GATK).
* Change `params.<tool>_version`. This will pull a different version of the same image from the registry.
* Change `params.docker_image_<tool>`. This will explicitly set the image to use, ignoring `docker_container_registry` and `<tool>_version`, and thus requires that the docker tag be explicitly set (e.g. `broadinstitute/gatk:4.2.4.1`).
* Change `params.docker_image_<tool>`. This will explicitly set the image to use, ignoring `docker_container_registry` and `<tool>_version`, and thus requires that the docker tag be explicitly set (e.g. `broadinstitute/gatk:4.4.0.0`).

| Tool Parameter | Version Parameter | Default | Notes |
| ------------------------ | -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------------- |
| `docker_image_bcftools` | `bcftools_version` | `ghcr.io/uclahs-cds/bcftools-score:1.20_score-1.20-20240505` | This image must have both BCFtools and the score plugins available. |
| `docker_image_bedtools` | `bedtools_version` | `ghcr.io/uclahs-cds/bedtools:2.31.0` | |
| `docker_image_gatk` | `gatk_version` | `broadinstitute/gatk:4.2.4.1` | |
| `docker_image_gatk` | `gatk_version` | `broadinstitute/gatk:4.4.0.0` | |
| `docker_image_pipeval` | `pipeval_version` | `ghcr.io/uclahs-cds/pipeval:5.0.0-rc.3` | |
| `docker_image_samtools` | `samtools_version` | `ghcr.io/uclahs-cds/samtools:1.20` | |
| `doker_image_stablelift` | `stablelift_version` | `ghcr.io/uclahs-cds/stablelift:FIXME` | This image is built and maintained via this repository. |
Expand Down Expand Up @@ -191,7 +194,7 @@ Please see list of [Contributors](https://github.com/uclahs-cds/pipeline-StableL

pipeline-StableLift is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.

StableLift is a machine learning approach designed to predict variant stability across reference genome builds, supplementing LiftOver coordinate conversion and increasing portability of variant calls.
StableLift is a machine learning approach designed to predict variant stability across reference genome builds, supplementing LiftOver coordinate conversion to increase the portability of variant calls.

Copyright (C) 2024 University of California Los Angeles ("Boutros Lab") All rights reserved.

Expand Down
24 changes: 23 additions & 1 deletion config/custom_schema_types.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,29 @@ custom_schema_types {
}
}

/**
* Check that the input is numeric in the appropriate range.
*/
ranged_number = { Map options, String name, Map properties ->
if (!(properties.containsKey('min') && properties['min'] in Number)) {
throw new Exception('`min` parameter misconfigured - must be a Number.')
}

if (!(properties.containsKey('max') && properties['max'] in Number)) {
throw new Exception('`max` parameter misconfigured - must be a Number.')
}

if (!(options[name] in Number)) {
throw new Exception("${name} must be a Number, not ${options[name].getClass()}")
}

if (options[name] < properties.min || properties.max < options[name]) {
throw new Exception("${name}=${options[name]} is not in range [${properties.min}, ${properties.max}]")
}
}

types = [
'FuncotatorDataSource': custom_schema_types.check_funcotator_data_source
'FuncotatorDataSource': custom_schema_types.check_funcotator_data_source,
'RangedNumber': custom_schema_types.ranged_number
]
}
20 changes: 20 additions & 0 deletions config/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,23 @@ input:
mode: 'r'
required: true
help: 'Input dataset supplied by input yaml'

target_threshold:
type: 'RangedNumber'
required: false
min: 0
max: 1
help: >-
Optional parameter specifying target Stability Score threshold for variant
filtering Default behavior without `target_threshold` or
`target_specificity` specified uses threshold maximizing F1-score in whole
genome validation set'.
target_specificity:
type: 'RangedNumber'
required: false
min: 0
max: 1
help: >-
Optional parameter specifying target specificity for variant filtering
based on whole genome validation set. Overrides `target_threshold`.
17 changes: 14 additions & 3 deletions config/template.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,28 @@ includeConfig "${projectDir}/config/default.config"
includeConfig "${projectDir}/config/methods.config"
includeConfig "${projectDir}/nextflow.config"


// Inputs/parameters of the pipeline
params {
// input/output locations
output_dir = 'where/to/save/outputs/'
output_dir = "where/to/save/outputs/"

// Choices: ["Mutect2", "HaplotypeCaller"]
// Choices: ["HaplotypeCaller", "Mutect2", "Strelka2", "SomaticSniper", "Muse2", "Delly2"]
variant_caller = "Mutect2"

// Path to pre-trained random forest model
rf_model = ""

// Optional parameter specifying target Stability Score threshold for
// variant filtering Default behavior without `target_threshold` or
// `target_specificity` specified uses threshold maximizing F1-score in
// whole genome validation set. Must be in the range [0.0, 1.0].
// target_threshold = 0.5

// Optional parameter specifying target specificity for variant filtering
// based on whole genome validation set Overrides `target_threshold`. Must
// be in the range [0.0, 1.0],
// target_specificity = 0.5

// Reference files
funcotator_data {
data_source = "/hot/ref/tool-specific-input/Funcotator/somatic/funcotator_dataSources.v1.7.20200521s"
Expand Down
4 changes: 2 additions & 2 deletions docs/pipeline.mmd
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ flowchart TD
--> bcftools_annotate2["`bcftools annotate*Trinucleotide*`"]:::bcftools
end

blocknote["`**Note:** Annotation is performed before Liftover when lifting backward`"]
blocknote["`**Note:** Annotation is performed prior to LiftOver when converting from GRCh38 -> GRCh37`"]

bcftools_liftover ---> gatk_func
bcftools_annotate2 --> r_extract_snv[extract-VCF-features.R]:::R
Expand All @@ -79,7 +79,7 @@ flowchart TD
joinpaths ---> r_predict_stability

subgraph Predict Stability ["`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Predict Stability**`"]
r_predict_stability[predict-liftover-stability.R]:::R
r_predict_stability[predict-variant-stability.R]:::R
--> bcftools_annotate3["`bcftools annotate*Stability*`"]:::bcftools

rf_model([rf_model]):::input .-> r_predict_stability
Expand Down
6 changes: 4 additions & 2 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ log.info """\
chain_file: ${params.chain_file}
repeat_bed: ${params.repeat_bed}
header_contigs: ${params.getOrDefault('header_contigs', null)}
funcotator_data:
data_source: ${params.funcotator_data.data_source}
src_reference_id: ${params.funcotator_data.src_reference_id}
Expand Down Expand Up @@ -149,10 +151,10 @@ workflow {
// Take the SV branch
workflow_extract_sv_annotations(
validated_vcf_tuple,
input_ch_src_sequence,
Channel.value(params.header_contigs),
Channel.value(params.gnomad_rds),
Channel.value(params.chain_file),
Channel.value(params.variant_caller)
Channel.value(params.chain_file)
)

workflow_extract_sv_annotations.out.liftover_vcf.set { liftover_vcf }
Expand Down
12 changes: 8 additions & 4 deletions module/predict_stability.nf
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,17 @@ process predict_stability_StableLift {
tuple val(sample_id), path("stability.tsv"), emit: stability_tsv

script:
spec_arg = (params.getOrDefault('target_specificity', null) != null) ? "--specificity \"${params.get('target_specificity')}\"" : ""
thresh_arg = (params.getOrDefault('target_threshold', null) != null) ? "--threshold \"${params.get('target_threshold')}\"" : ""

"""
Rscript "${moduleDir}/scripts/predict-liftover-stability.R" \
Rscript "${moduleDir}/scripts/predict-variant-stability.R" \
--variant-caller "${variant_caller}" \
--features-dt "${features_rds}" \
--rf-model "${rf_model}" \
--variant-caller "${variant_caller}" \
--output-tsv "stability.tsv"
--output-tsv "stability.tsv" \
${spec_arg} \
${thresh_arg}
"""

stub:
Expand All @@ -43,7 +48,6 @@ process run_apply_stability_annotations {
input:
tuple val(sample_id),
path(annotated_vcf, stageAs: 'inputs/*'),
// FIXME Should there be an annotated_vcf_tbi?
path(stability_tsv, stageAs: 'inputs/*'),
path(stability_tsv_tbi, stageAs: 'inputs/*')

Expand Down
Loading

0 comments on commit dda927b

Please sign in to comment.