Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organizational and cleanup tweaks #82

Merged
merged 7 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Unreleased]

### [Changed]
- Use `run_validate_PipeVal_with_metadata` to gate on validation
- Move index/dictionary file discovery to configuration stage
- Include all parameters from `default.config` in the README

---

## [1.0.1] - 2024-05-29
Expand Down
27 changes: 23 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,12 +150,31 @@ For normal-only or tumour-only samples, exclude the fields for the other state.
| `bundle_v0_dbsnp138_vcf_gz` | Yes | path | Absolute path to dbsnp file, e.g., `/hot/ref/tool-specific-input/GATK/GRCh38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz` |
| `bundle_contest_hapmap_3p3_vcf_gz` | Yes | path | Absolute path to HapMap 3.3 biallelic sites file, e.g., `/hot/ref/tool-specific-input/GATK/GRCh38/Biallelic/hapmap_3.3.hg38.BIALLELIC.PASS.2021-09-01.vcf.gz` |
| `work_dir` | optional | path | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively. |
| `docker_container_registry` | optional | string | Registry containing tool Docker images. Default: `ghcr.io/uclahs-cds` |
| `metapipeline_delete_input_bams` | optional | boolean | Set to true to delete the input BAM files once the initial processing step is complete. **WARNING**: This option should NOT be used for individual runs of recalibate-BAM; it's intended for metapipeline-DNA to optimize disk space usage by removing files that are no longer needed from the `workDir`. |
| `metapipeline_final_output_dir` | optional | string | Absolute path for the final output directory of metapipeline-DNA that's expected to contain the output BAM from align-DNA. **WARNING**: This option should not be used for individual runs of recalibrate-BAM; it's intended for metapipeline-DNA to optimize disk space usage. |
| `metapipeline_states_to_delete` | optional | list | List of states for which to delete input BAMs. **WARNING**: This option should not be used for individual runs of recalibrate-BAM; it's intended for metapipeline-DNA to optimize disk space usage. |
| `base_resource_update` | optional | namespace | Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below. |


The below parameters have default values defined in [`default.config`](./config/default.config) and generally do not need to be set by the user.

| Optional Parameter | Type | Description |
| :------------------| :----| :-----------|
| `metapipeline_delete_input_bams` | boolean | Set to true to delete the input BAM files once the initial processing step is complete. **WARNING**: This option should NOT be used for individual runs of recalibate-BAM; it's intended for metapipeline-DNA to optimize disk space usage by removing files that are no longer needed from the `workDir`. |
| `metapipeline_final_output_dir` | string | Absolute path for the final output directory of metapipeline-DNA that's expected to contain the output BAM from align-DNA. **WARNING**: This option should not be used for individual runs of recalibrate-BAM; it's intended for metapipeline-DNA to optimize disk space usage. |
| `metapipeline_states_to_delete` | list | List of states for which to delete input BAMs. **WARNING**: This option should not be used for individual runs of recalibrate-BAM; it's intended for metapipeline-DNA to optimize disk space usage. |
| `cache_intermediate_pipeline_steps` | boolean | Enable process caching from Nextflow. |
| `ucla_cds` | boolean | Overwrite default memory and CPU values by cluster-specific configs. |
| `docker_container_registry` | string | Registry containing tool Docker images. |
| `docker_image_gatk`, `gatk_version` | string | Docker image name and version for GATK. |
| `docker_image_pipeval`, `pipeval_version` | string | Docker image name and version for PipeVal. |
| `docker_image_gatk3`, `gatk3_version` | string | Docker image name and version for GATK3. |
| `docker_image_picard`, `picard_version` | string | Docker image name and version for Picard. |
| `docker_image_samtools`, `samtools_version` | string | Docker image name and version for SAMtools. |
| `reference_fasta_fai`, `reference_fasta_dict` | path | Index and dictionary files for the required input. Default: Matching `.fai` and `.dict` files in the same directory. |
| `bundle_v0_dbsnp138_vcf_gz_tbi` | path | Index file for the required input. Default: Matching `.tbi` file in the same directory. |
| `bundle_known_indels_vcf_gz_tbi` | path | Index file for the required input. Default: Matching `.tbi` file in the same directory. |
| `bundle_contest_hapmap_3p3_vcf_gz_tbi`| path | Index file for the required input. Default: Matching `.tbi` file in the same directory. |
| `bundle_mills_and_1000g_gold_standard_indels_vcf_gz_tbi` | path | Index file for the required input. Default: Matching `.tbi` file in the same directory. |


#### Base resource allocation updaters
To update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts. The default allocations can be found in the [node-specific config files](./config/)
```Nextflow
Expand Down
10 changes: 10 additions & 0 deletions config/default.config
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import nextflow.util.SysHelper
import nextflow.Nextflow

// Default inputs/parameters of the pipeline

params {
min_cpus = 1
min_memory = 1.MB
Expand Down Expand Up @@ -30,6 +32,14 @@ params {
docker_image_samtools = "${-> params.docker_container_registry}/samtools:${params.samtools_version}"

gatk_ir_compression = 1

// These parameters are inferred from the input files. The user can override them in the config file if required.
reference_fasta_fai = "${-> params.reference_fasta}.fai"
reference_fasta_dict = "${-> Nextflow.file(params.reference_fasta).resolveSibling(Nextflow.file(params.reference_fasta).getBaseName() + '.dict')}"
yashpatel6 marked this conversation as resolved.
Show resolved Hide resolved
bundle_known_indels_vcf_gz_tbi = "${-> params.bundle_known_indels_vcf_gz}.tbi"
bundle_contest_hapmap_3p3_vcf_gz_tbi = "${-> params.bundle_contest_hapmap_3p3_vcf_gz}.tbi"
bundle_mills_and_1000g_gold_standard_indels_vcf_gz_tbi = "${-> params.bundle_mills_and_1000g_gold_standard_indels_vcf_gz}.tbi"
bundle_v0_dbsnp138_vcf_gz_tbi = "${-> params.bundle_v0_dbsnp138_vcf_gz}.tbi"
}

// Process specific scope
Expand Down
30 changes: 30 additions & 0 deletions config/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,26 +56,56 @@ reference_fasta:
mode: 'r'
required: true
help: 'Absolute path to reference genome fasta'
reference_fasta_fai:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to reference genome fasta index file'
reference_fasta_dict:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to reference genome fasta dictionary'
bundle_mills_and_1000g_gold_standard_indels_vcf_gz:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to Mills and 1000g gold standard INDELs VCF'
bundle_mills_and_1000g_gold_standard_indels_vcf_gz_tbi:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to Mills and 1000g gold standard INDELs VCF index file'
bundle_known_indels_vcf_gz:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to known INDELs VCF'
bundle_known_indels_vcf_gz_tbi:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to known INDELs VCF index file'
bundle_v0_dbsnp138_vcf_gz:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to v0 dbSNP 138 VCF'
bundle_v0_dbsnp138_vcf_gz_tbi:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to v0 dbSNP 138 VCF index file'
bundle_contest_hapmap_3p3_vcf_gz:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to ConEst HapMap 3p3 VCF'
bundle_contest_hapmap_3p3_vcf_gz_tbi:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to ConEst HapMap 3p3 VCF index file'
metapipeline_delete_input_bams:
type: 'Bool'
required: true
Expand Down
79 changes: 39 additions & 40 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Current Configuration:
intervals: ${(params.is_targeted) ?: 'WGS'}
Recalibration tables: ${params.input.recalibration_table}

- output:
- output:
output: ${params.output_dir}
output_dir_base: ${params.output_dir_base}
log_output_dir: ${params.log_output_dir}
Expand All @@ -43,7 +43,7 @@ Starting workflow...
------------------------------------
"""

include { run_validate_PipeVal } from './external/pipeline-Nextflow-module/modules/PipeVal/validate/main.nf' addParams(
include { run_validate_PipeVal_with_metadata } from './external/pipeline-Nextflow-module/modules/PipeVal/validate/main.nf' addParams(
options: [
docker_image_version: params.pipeval_version,
main_process: "./" //Save logs in <log_dir>/process-log/run_validate_PipeVal
Expand Down Expand Up @@ -82,51 +82,46 @@ workflow {
/**
* Input channel processing
*/
Channel.from(params.samples_to_process)
.map{ sample -> ['index': indexFile(sample.path)] + sample }
.set{ input_ch_samples_with_index }

input_ch_samples_with_index
.map{ sample -> [sample.path, sample.index] }
.flatten()
.set{ input_ch_validate }

input_ch_samples_with_index
.map{ sample -> sample.id }
.flatten()
.set{ input_ch_sample_ids }

input_ch_samples_with_index
.reduce( ['bams': [], 'indices': []] ){ a, b ->
a.bams.add(b.path);
a.indices.add(b.index);
return a
}
.set{ input_ch_collected_files }


/**
* Input validation
*/
run_validate_PipeVal(input_ch_validate)
Channel.from(params.samples_to_process)
.flatMap { sample ->
def all_metadata = sample.findAll { it.key != "path" }
return [
[sample.path, [all_metadata, "path"]],
[indexFile(sample.path), [[id: sample.id], "index"]]
]
} | run_validate_PipeVal_with_metadata

run_validate_PipeVal.out.validation_result
run_validate_PipeVal_with_metadata.out.validation_result
.collectFile(
name: 'input_validation.txt',
storeDir: "${params.output_dir_base}/validation"
)

run_validate_PipeVal_with_metadata.out.validated_file
.map { filename, metadata -> [metadata[0].id, metadata[0] + [(metadata[1]): filename]] }
.groupTuple()
.map { it[1].inject([:]) { result, i -> result + i } }
.set { validated_samples_with_index }

// The elements of validated_samples_with_index are the same as
// params.samples_to_process, with the following changes:
// * sample.path is the validated BAM file
// * sample.index is the validated BAI file (new key)

/**
* Interval extraction and splitting
*/
extract_GenomeIntervals("${file(params.reference_fasta).parent}/${file(params.reference_fasta).baseName}.dict")
extract_GenomeIntervals(params.reference_fasta_dict)

run_SplitIntervals_GATK(
extract_GenomeIntervals.out.genomic_intervals,
params.reference_fasta,
"${params.reference_fasta}.fai",
"${file(params.reference_fasta).parent}/${file(params.reference_fasta).baseName}.dict"
params.reference_fasta_fai,
params.reference_fasta_dict
)

run_SplitIntervals_GATK.out.interval_list
Expand All @@ -143,18 +138,22 @@ workflow {
/**
* Indel realignment
*/
input_ch_collected_files
validated_samples_with_index
.reduce( ['bams': [], 'indices': []] ){ a, b ->
a.bams.add(b.path);
a.indices.add(b.index);
return a
}
.combine(input_ch_intervals)
.map{ it -> it[0] + it[1] }
.set{ input_ch_indel_realignment }

realign_indels(input_ch_indel_realignment)


/**
* Input file deletion
*/
input_ch_samples_with_index
validated_samples_with_index
.filter{ params.metapipeline_states_to_delete.contains(it.sample_type) }
.map{ sample -> sample.path }
.flatten()
Expand All @@ -179,7 +178,7 @@ workflow {
*/
recalibrate_base(
realign_indels.out.output_ch_realign_indels,
input_ch_sample_ids
validated_samples_with_index.map{ sample -> sample.id }.flatten()
)


Expand All @@ -206,21 +205,21 @@ workflow {

run_GetPileupSummaries_GATK(
params.reference_fasta,
"${params.reference_fasta}.fai",
"${file(params.reference_fasta).parent}/${file(params.reference_fasta).baseName}.dict",
params.reference_fasta_fai,
params.reference_fasta_dict,
params.bundle_contest_hapmap_3p3_vcf_gz,
"${params.bundle_contest_hapmap_3p3_vcf_gz}.tbi",
params.bundle_contest_hapmap_3p3_vcf_gz_tbi,
input_ch_summary_intervals,
input_ch_merged_bams
)

input_ch_samples_with_index
validated_samples_with_index
.filter{ it.sample_type == 'normal' }
.map{ it -> [sanitize_string(it.id)] }
.join(run_GetPileupSummaries_GATK.out.pileupsummaries)
.set{ normal_pileupsummaries }

input_ch_samples_with_index
validated_samples_with_index
.filter{ it.sample_type == 'tumor' }
.map{ it -> [sanitize_string(it.id)] }
.join(run_GetPileupSummaries_GATK.out.pileupsummaries)
Expand Down Expand Up @@ -254,8 +253,8 @@ workflow {

run_DepthOfCoverage_GATK(
params.reference_fasta,
"${params.reference_fasta}.fai",
"${file(params.reference_fasta).parent}/${file(params.reference_fasta).baseName}.dict",
params.reference_fasta_fai,
params.reference_fasta_dict,
input_ch_summary_intervals,
input_ch_merged_bams
)
Expand Down
24 changes: 12 additions & 12 deletions module/base-recalibration.nf
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ include {
reference_fasta: path to reference genome fasta file
reference_fasta_fai: path to index for reference fasta
reference_fasta_dict: path to dictionary for reference fasta
bundle_mills_and_1000g_gold_standards_vcf_gz: path to standard Mills and 1000 genomes variants
bundle_mills_and_1000g_gold_standards_vcf_gz_tbi: path to index file for Mills and 1000g variants
bundle_mills_and_1000g_gold_standard_indels_vcf_gz: path to standard Mills and 1000 genomes variants
bundle_mills_and_1000g_gold_standard_indels_vcf_gz_tbi: path to index file for Mills and 1000g variants
bundle_known_indels_vcf_gz: path to set of known indels
bundle_known_indels_vcf_gz_tbi: path to index of known indels VCF
bundle_v0_dbsnp138_vcf_gz: path to dbSNP variants
Expand Down Expand Up @@ -46,8 +46,8 @@ process run_BaseRecalibrator_GATK {
path(reference_fasta)
path(reference_fasta_fai)
path(reference_fasta_dict)
path(bundle_mills_and_1000g_gold_standards_vcf_gz)
path(bundle_mills_and_1000g_gold_standards_vcf_gz_tbi)
path(bundle_mills_and_1000g_gold_standard_indels_vcf_gz)
path(bundle_mills_and_1000g_gold_standard_indels_vcf_gz_tbi)
path(bundle_known_indels_vcf_gz)
path(bundle_known_indels_vcf_gz_tbi)
path(bundle_v0_dbsnp138_vcf_gz)
Expand All @@ -71,7 +71,7 @@ process run_BaseRecalibrator_GATK {
${all_ir_bams} \
--reference ${reference_fasta} \
--verbosity INFO \
--known-sites ${bundle_mills_and_1000g_gold_standards_vcf_gz} \
--known-sites ${bundle_mills_and_1000g_gold_standard_indels_vcf_gz} \
--known-sites ${bundle_known_indels_vcf_gz} \
--known-sites ${bundle_v0_dbsnp138_vcf_gz} \
--output ${sample_id}_recalibration_table.grp \
Expand Down Expand Up @@ -182,14 +182,14 @@ workflow recalibrate_base {

run_BaseRecalibrator_GATK(
params.reference_fasta,
"${params.reference_fasta}.fai",
"${file(params.reference_fasta).parent}/${file(params.reference_fasta).baseName}.dict",
params.reference_fasta_fai,
params.reference_fasta_dict,
params.bundle_mills_and_1000g_gold_standard_indels_vcf_gz,
"${params.bundle_mills_and_1000g_gold_standard_indels_vcf_gz}.tbi",
params.bundle_mills_and_1000g_gold_standard_indels_vcf_gz_tbi,
params.bundle_known_indels_vcf_gz,
"${params.bundle_known_indels_vcf_gz}.tbi",
params.bundle_known_indels_vcf_gz_tbi,
params.bundle_v0_dbsnp138_vcf_gz,
"${params.bundle_v0_dbsnp138_vcf_gz}.tbi",
params.bundle_v0_dbsnp138_vcf_gz_tbi,
base_recalibrator_intervals,
params.input.recalibration_table,
input_ch_base_recalibrator
Expand Down Expand Up @@ -220,8 +220,8 @@ workflow recalibrate_base {

run_ApplyBQSR_GATK(
params.reference_fasta,
"${params.reference_fasta}.fai",
"${file(params.reference_fasta).parent}/${file(params.reference_fasta).baseName}.dict",
params.reference_fasta_fai,
params.reference_fasta_dict,
input_ch_apply_bqsr
)

Expand Down
Loading