diff --git a/.github/workflows/render-puml.yaml b/.github/workflows/render-puml.yaml new file mode 100644 index 00000000..23e86213 --- /dev/null +++ b/.github/workflows/render-puml.yaml @@ -0,0 +1,20 @@ +--- +name: PlantUML Generation + +on: + push: + paths: + - '**.puml' + workflow_dispatch: + +jobs: + plantuml: + runs-on: ubuntu-latest + + steps: + - name: Generate PUML diagrams + uses: uclahs-cds/tool-PlantUML-action@v1.0.0 + with: + github-token: ${{ secrets.GITHUB_TOKEN }} + ghcr-username: ${{ github.actor }} + ghcr-password: ${{ secrets.GITHUB_TOKEN }} diff --git a/CHANGELOG.md b/CHANGELOG.md index 1588b318..ae327323 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,8 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm ### [Added] - Sort BAMs before merging for consistent order of output BAM @PG header lines - Add NFTest case +- Add new flow diagram to README +- Add additional details to Pipeline Steps section of README --- diff --git a/README.md b/README.md index c5332ab9..c58a74aa 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ This pipeline takes BAMs and corresponding indices from [pipeline-align-DNA](htt 1. Update the params section of the .config file ([Example config](config/template.config)). -2. Update the YAML. +2. Update the YAML ([Template YAMLs](input/)). 3. Download the submission script (submit_nextflow_pipeline.py) from [here](https://github.com/uclahs-cds/tool-submit-nf), and submit your pipeline below. @@ -42,41 +42,44 @@ python submit_nextflow_pipeline.py \ ## Flow Diagram -![call-gSNP flow diagram](call-gSNP-DSL2.png) +![recalibrate-BAM flow diagram](docs/recalibrate-bam-flow.svg) --- ## Pipeline Steps -### 1. Split genome or target intervals into sub-intervals (either scattered or by chromosome) for parallelization -Use the input target intervals or the whole genome intervals and split them into sub-intervals for parallel processing. +### 1. Split genome into sub-intervals for parallelization +Split the reference genome into [intervals for parallel processing](https://gatk.broadinstitute.org/hc/en-us/articles/4414602449435-SplitIntervals). If `params.parallelize_by_chromosome` is set then the genome will be split by chromosome, otherwise it will be split into up to `params.scatter_count` intervals. -### 2. Realign Indels -Generate indel realignment targets and realign indels. +### 2. Realign indels +Generate [indel realignment targets](https://rawcdn.githack.com/broadinstitute/gatk-docs/8fcf44bb0686f2f7d442aade181ff6ed508a97de/gatk3-tooldocs/3.7-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.html) and [realign indels](https://rawcdn.githack.com/broadinstitute/gatk-docs/8fcf44bb0686f2f7d442aade181ff6ed508a97de/gatk3-tooldocs/3.7-0/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.html) per interval. ### 3. Generate BQSR (Base Quality Score Recalibration) -Assess how sequencing errors correlate with four covariates (assigned quality score, read group the read belongs, machine cycle producing this base, and current and immediately upstream base), and output base quality score recalibration table. +Assess how sequencing errors correlate with four covariates (assigned quality score, read group, machine cycle producing this base, and current and immediately upstream base) and output [base quality score recalibration table](https://gatk.broadinstitute.org/hc/en-us/articles/4414594385563-BaseRecalibrator). ### 4. Apply BQSR per split interval in parallel -Apply the recalibration per sample and reheader output as necessary. +[Apply the base quality score recalibration](https://gatk.broadinstitute.org/hc/en-us/articles/4414594339611-ApplyBQSR) to each interval and reheader output as necessary. ### 5. Merge interval-level BAMs -Merge BAMs from each interval to generate whole sample BAM. +[Merge BAMs](https://gatk.broadinstitute.org/hc/en-us/articles/4414594413083-MergeSamFiles-Picard-) from each interval to generate a whole sample BAM. -### 6. Deduplicate BAM -In the case of scattered intervals, run a deduplication process to remove reads duplicated dur to overlap on interval splitting sites. +#### 5a. Deduplicate BAM +If `params.parallelize_by_chromosome` is not set, run a deduplication process to remove reads duplicated due to overlap on interval splitting sites. + +### 6. Index BAM file +Generate a [BAI index file](http://www.htslib.org/doc/1.17/samtools-index.html) for fast random access of the whole sample BAM. ### 7. Get pileup summaries -Summarizes counts of reads that support reference, alternate and other alleles for given sites. Results will be used in the next Calculate Contamination step. +Tabulate [pileup metrics](https://gatk.broadinstitute.org/hc/en-us/articles/4414586785947-GetPileupSummaries) for inferring contamination. Summarize counts of reads that support reference, alternate and other alleles for given sites. ### 8. Calculate contamination -Calculates the fraction of reads coming from cross-sample contamination, given results from Step 8. Generates a tumor segmentation file. +Calculates the fraction of reads coming from [cross-sample contamination](https://gatk.broadinstitute.org/hc/en-us/articles/4414586751771-CalculateContamination), given results from Step 7. For paired samples, generates an additional output table containing segmentation of the tumor by minor allele fraction. ### 9. DepthOfCoverage -Calculate depth of coverage using the whole sample BAM from step 7. +If `params.is_DOC_run` is set, generate [coverage summary information](https://gatk.broadinstitute.org/hc/en-us/articles/4414586842523-DepthOfCoverage-BETA-) for the whole sample BAM from step 5, partitioned by sample, read group, and library. ### 10. Generate sha512 checksum -Generate sha512 checksum for final BAMs. +Generate sha512 checksum for final BAM and BAI files. --- diff --git a/docs/recalibrate-bam-flow.puml b/docs/recalibrate-bam-flow.puml new file mode 100644 index 00000000..3b877f32 --- /dev/null +++ b/docs/recalibrate-bam-flow.puml @@ -0,0 +1,114 @@ +@startuml + +skinparam SwimlaneTitleFontStyle bold + + +|s| Parallelized by sample +|i| Parallelized by interval + + +|s| + +start + +partition "**Unparallelized Setup**\nThis block is run once\nregardless of sample count" { + if (Parallelize by\nchromosome?) is (Yes) then + :==run_SplitIntervals_GATK + ---- + Split reference genome into + interval lists by chomosome + (1-22, X, Y, M, nonassembled); + else (No) + :==run_SplitIntervals_GATK + ---- + Split reference genome into + **scatter_count** interval lists; + endif +} + +:==run_validate_PipeVal +---- +Validate the input BAM and index file; + + +|i| + +:==run_RealignerTargetCreator_GATK +---- +Split input BAMs by interval and identify +potentially misaligned sub-intervals to +target across all input samples; + +:==run_IndelRealigner_GATK +---- +Realign indels across all input +samples simultaneously; + +|s| + +:==run_BaseRecalibrator_GATK +---- +Generate base quality score recalibration +(BQSR) table based on read group, reported +quality score, machine cycle, and nucleotide +context; + +|i| + +:==run_ApplyBQSR_GATK +---- +Apply the recalibration to each input +sample sequentially; + +|s| + +:==run_MergeSamFiles_Picard +---- +Merge interval BAMS into recalibrated BAM; + +if (Parallelize by\nchromosome?) is (No) then + :==deduplicate_records_SAMtools + ---- + Remove duplicate reads due to + overlap on interval splitting sites; +else (Yes) +endif + +:==run_index_SAMtools +---- +Create index file for recalibrated BAM; + +:==calculate_sha512 +---- +Generate sha512 checksum for +recalibrated BAM and index file; + +split + :==run_GetPileupSummaries_GATK + ---- + Summarize counts of reads that support + reference, alternate, and other alleles + for given sites; + + :==run_CalculateContamination_GATK + ---- + Calculate the fraction of reads coming + from cross-sample contamination. + + If the input is a paired sample, run + again in matched normal mode; +split again + if (Compute depth\nof coverage?) is (Yes) then + :==run_DepthOfCoverage_GATK + ---- + Assess sequence coverage by a wide array + of metrics, partioned by sample, read + group, and library; + endif +end split + +stop + + +@enduml + diff --git a/docs/recalibrate-bam-flow.svg b/docs/recalibrate-bam-flow.svg new file mode 100644 index 00000000..f35b534a --- /dev/null +++ b/docs/recalibrate-bam-flow.svg @@ -0,0 +1,123 @@ +Unparallelized SetupThis block is run onceregardless of sample countParallelize bychromosome?YesNorun_SplitIntervals_GATKSplit reference genome intointerval lists by chomosome(1-22, X, Y, M, nonassembled)run_SplitIntervals_GATKSplit reference genome intoscatter_countinterval listsrun_validate_PipeValValidate the input BAM and index filerun_BaseRecalibrator_GATKGenerate base quality score recalibration(BQSR) table based on read group, reportedquality score, machine cycle, and nucleotidecontextrun_MergeSamFiles_PicardMerge interval BAMS into recalibrated BAMdeduplicate_records_SAMtoolsRemove duplicate reads due tooverlap on interval splitting sitesNoParallelize bychromosome?Yesrun_index_SAMtoolsCreate index file for recalibrated BAMcalculate_sha512Generate sha512 checksum forrecalibrated BAM and index filerun_GetPileupSummaries_GATKSummarize counts of reads that supportreference, alternate, and other allelesfor given sitesrun_CalculateContamination_GATKCalculate the fraction of reads comingfrom cross-sample contamination. If the input is a paired sample, runagain in matched normal moderun_DepthOfCoverage_GATKAssess sequence coverage by a wide arrayof metrics, partioned by sample, readgroup, and libraryYesCompute depthof coverage?run_RealignerTargetCreator_GATKSplit input BAMs by interval and identifypotentially misaligned sub-intervals totarget across all input samplesrun_IndelRealigner_GATKRealign indels across all inputsamples simultaneouslyrun_ApplyBQSR_GATKApply the recalibration to each inputsample sequentiallyParallelized by sampleParallelized by interval \ No newline at end of file diff --git a/input/recalibrate-BAM-paired-input.yaml b/input/recalibrate-BAM-paired-input.yaml index 0e2743ab..cafe288b 100644 --- a/input/recalibrate-BAM-paired-input.yaml +++ b/input/recalibrate-BAM-paired-input.yaml @@ -5,4 +5,4 @@ input: normal: - "/absolute/path/to/BAM" tumor: - - "/abosolute/path/to/BAM" + - "/absolute/path/to/BAM"