Skip to content

Commit

Permalink
Absorb refactoring for flexible interval creation (#8)
Browse files Browse the repository at this point in the history
* Absorbs latest nfcore and cgpu fork changes (#6)

* nf-core bump-version . 2.5.1dev

* Remove PublishDirMode from test profile (nf-core#40)

* remove PublishDirMode from test profile

* update all tools

* minor updates + typo fix (nf-core#42)

* minor updates + typo fix

* fix VEP automated builds

* add location for abstracts

* remove reference to old buil.nf script

* update CHANGELOG

* Update docs/reference.md

Co-Authored-By: Szilveszter Juhos <[email protected]>

* Update docs/reference.md

* Update docs/reference.md

* Worfklow (nf-core#45)

* Add workflow figure
* Include workflow figure in readme
* Update CHANGELOG

* add minimal genome and update some processes

* Start adding mouse data

* Update iGenomes.config

* Add tbi

* Drop ASCAT files

* apply changes from 2.5.1 to dev

* bump version to 2.5.2dev

* update CHANGELOG

* update tiddit to 2.8.1

* Use Version 98 of Mouse

* Add for grcm38

* Adjust mus musculus DB

* Annotation

* add smallerGRCh37 and minimalGRCh37

* use bwa aln when no knowIndels, otherwise use bwa mem, noIntervals currently in the process of being added everywhere

* don't use bwa aln

* add automatic generation of intervals file based on fastaFai file

* Adjusted genomes.config

* Should be list

* Set genomes_base to something

* Revert back

* enable CreateIntervalsBed for intervals_list from GATK Bundle

* Add proper calling list

* Use the bed file

* remove temp file

* update CHANGELOG

* Fix genome fa.fai

* Add in mgpv5

* Try short track

* Add in species handling

* Document new parameter species

* Add changelog

* Fix iGenomes stuff

* Add in note about GRCm38

* Fix small fai index issue

* Adjusted quotes in genomes.config

* And the same for igenomes

* Better folder structure for Mouse Genome Project data

* Minor adjustment to propoer paths

* Apply suggestions from code review

Add changes by Maxime

Co-Authored-By: Maxime Garcia <[email protected]>

* Remove space

* Move it up

* Update CHANGELOG.md

Co-Authored-By: Maxime Garcia <[email protected]>

* add minimal tests

* fix processes with no intervals

* add comments

* params noIntervals -> no_intervals

* sort genomes + add news

* code polishing

* update CHANGELOG

* add split_fastq params to split the fastq files with the splitFastq() nf method

* add tests

* temporarely remove TIDDIT tests

* add sention for bwa mem

* disable docker and singularity

* disable container

* add fastaFai for bwamem

* remove module samtools from label sentieon

* fix output from bwa mem

* fix output channel BamMapped from MapReads

* set params.sentieon to null by default

* add SentieonDedup process

* fix typo

* add fastaFai to SentieonDedup process

* fix bam indexing

* fix bam indexing

* fix bam indexing

* add SentieonBQSR

* add label sentieon to SentieonBQSR

* fix metrics output for SentieonBQSR

* increase cpus for Sentieon BQSR

* remove indexing

* add index for dedup

* bwa mem sentieon specific process

* TSV file for sentieon Dedup

* TSV for every step for Sentieon

* recal -> deduped

* fix input for TSV recalibrate

* enable restart from recalibrate with TSV with Sentieon

* fix sention variant calling from mapping and recalibrate

* code polishing

* add dump tag for imput sample

* add dump tag for bamDedupedSentieon

* code polishing

* code polishing

* code polishing

* code polishing

* remove when statement

* fix typo

* remove tsv for recalibrate with sentieon

* add dnascope dnaseq

* fix dnascope

* add TNscope process

* fix TNscope output

* add pon for TNscope

* add params.pon_index

* add annotation for sention DNAseq, DNAscope, TNscope

* add default pon_index

* typo

* fix typo

* improve automatic annotation

* typo

* typo

* add condition on when statement on TNscope

* clean up

* code polish

* add CODEOWNERS file

* add when statement on all sentieon processes with params.sentieon

* remove munin sentieon specific configs from config

* load sarek specific config

* update path to specific config

* update docs

* remove Freebayes

* update workflow image

* remove old logo

* fix tests

* add docs about params split_fastq

* update CHANGELOG

* improve docs

* more tests but less NF versions

* actually run the tests

* typo

* simplify configs

* add test for mpileup

* go crazy with tests

* fix tests

* includ test.config

* restore FreeBayes

* remove label memory_max from BaseRecalibrator process to fix nf-core#72

* add --skipQC all and --tools Manta,mpileup,Strelka to minimal genome tests

* update Nextflow version

* update Nextflow version

* update Nextflow

* add --step annotation to profile

* don't need to specify step here

* move params initalization

* add docs

* fix markdownlint

* more complete docs + sort genomes

* improve tests

* update docs

* update CHANGELOG

* improve script

* fix tests

* better comments

* better comments

* fix error on channel name

* fix output for MergeBamRecal

* fix MergeBamRecal output

* fix TSV file

* update comments and docs

* add warning for sentieon only processes

* nf-core bump-version . 2.5.2

* manual bump-version . 2.5.2

* update workflow image

* downgrade tools for release

* update CHANGELOG

* clean up and update workflow image

* allow a

* fix workflow image

* Apply suggestions from code review

* Apply suggestions from code review

* Update docs/output.md

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

* Reformats `bwa mem | samtools sort` command; WIP suboptimal resource usage

* Addresses #5 ;WIP

* Removes max_ resource alloc labels from MarkDuplicatesSpark

* Replaces .md.bam.bai->.md.bai (same as nf-core)

* Add ${markdup_java_options} to MarkDuplicatesSpark (same as nf-core MarkDuplicates)

* Changes MarkDuplicates --verbosity, DEBUG->INFO

* Changes intervalBed.simpleName->intervalBed.baseName; nf-cored

* Removes label cpus_1 from BaseRecalibratorSpark

* Remove cpus_2 labels from ApplyBQSRSpark; DEBUG->INFO

* Changes pseudo file "no_vepFile.txt" from https to s3 link

* Removes java options from ApplyBQSRSpark

* Removes java options from MarkDupesSpark

* Add java-options to MarkDupesSpark; verbosity INFO->ERROR

* Fixes dupe --java-options; 🤦

* Attempt to fix MarkDupesSpark; "--lower-case"->"-CAP"; Removed tmp

* Adds soft-coded allocation of resources to MapReads

* Initialise params for MapReads split resource alloc

* Adds neglected curlies around params

* Adds neglected \ to bash vars

* Adds neglected \ to bash vars

* WIP; MapReads optimisations

* Implement resource alloc between bwa and samtools

* Adds max, med soft coded resource alloc

* Re-labels processes (from hard coded resources to soft)

* Adds extra curlies to addrees priority of eval

* Add explicit declaration of maxForks/process

* Update med resource allocation function

* Add echo true and echo of  ${bwa_cpus} and ${sort_cpus}

* Hard code heap in MarkDuplicatesSpark at 8g

* Correct expected output bai in MarkDupes

* Removes Spark versions; Not stable with low resources

* Removes sorting; Picard might sort?

* Do not assume sorting in MarkDupes

* Adds explicit --ASSUME_SORT_ORDER unsorted

* Adds missing \\

* Omits -k 23

* Bringing sorted back

* Eliminating pipes in mapping step

* Adds bwa -k 23 and GenomeChronicler as tool (cgpu#16)

- [x] Adds -k 23 (bwa mem seed length)
- [x] Exposes as params bwa_cpus, sort_cpus
- [x] Adds GenomeChronicler in tools (sarek logic)

Co-authored-by: Alexander Peltzer <[email protected]>
Co-authored-by: Maxime Garcia <[email protected]>
Co-authored-by: Szilveszter Juhos <[email protected]>

* Replaces intervals process to be flexible

* Updated nextflow.config with new intervals process

* Updates conf/base.config; Removes dynamic resource alloc

Co-authored-by: Alexander Peltzer <[email protected]>
Co-authored-by: Maxime Garcia <[email protected]>
Co-authored-by: Szilveszter Juhos <[email protected]>
  • Loading branch information
4 people authored Jan 21, 2020
1 parent d54f075 commit 036348c
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 165 deletions.
34 changes: 22 additions & 12 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@
*/

process {
cpus = {check_resource(params.cpus * task.attempt)}
memory = {check_resource((params.singleCPUMem as nextflow.util.MemoryUnit) * task.attempt)}
time = {check_resource(24.h * task.attempt)}

shell = ['/bin/bash', '-euo', 'pipefail']

errorStrategy = {task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish'}
Expand All @@ -21,9 +19,11 @@ process {

withLabel:cpus_1 {
cpus = {check_resource(1)}
memory = 7.5.GB
}
withLabel:cpus_2 {
cpus = {check_resource(2)}
memory = 15.GB
}
withLabel:cpus_4 {
cpus = {check_resource(4)}
Expand All @@ -37,18 +37,28 @@ process {
withLabel:cpus_max {
cpus = {params.max_cpus}
}

withLabel:memory_singleCPU_2_task {
memory = {check_resource((params.singleCPUMem as nextflow.util.MemoryUnit) * 2 * task.attempt)}
}
withLabel:memory_singleCPU_task_sq {
memory = {check_resource((params.singleCPUMem as nextflow.util.MemoryUnit) * task.attempt * task.attempt)}
}

withLabel:memory_max {
memory = {params.max_memory}
}

withName: MarkDuplicates {
maxForks = 2
}
withName: BaseRecalibrator {
maxForks = 32
}
withName: ApplyBQSRS {
maxForks = 32
}
withName: ScatterIntervalList {
container = 'us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.1-1540490856'
}
withName: HaplotypeCaller {
cpus = {check_resource(1)}
memory = 7.5.GB
maxForks = 30
maxRetries = params.preemptible_tries
container = 'us.gcr.io/broad-gatk/gatk:4.0.10.1'
}
withName:ConcatVCF {
// For unknown reasons, ConcatVCF sometimes fails with SIGPIPE
// (exit code 141). Rerunning the process will usually work.
Expand Down
133 changes: 41 additions & 92 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -589,78 +589,37 @@ ch_intervals = params.no_intervals ? "null" : params.intervals && !('annotate' i

// STEP 0: CREATING INTERVALS FOR PARALLELIZATION (PREPROCESSING AND VARIANT CALLING)

process CreateIntervalBeds {
tag {intervals.fileName}
// This task calls picard's IntervalListTools to scatter the input interval list into scatter_count sub interval lists
// Note that the number of sub interval lists may not be exactly equal to scatter_count. There may be slightly more or less.
// Thus we have the block of python to count the number of generated sub interval lists.

input:
file(intervals) from ch_intervals

output:
file '*.bed' into bedIntervals mode flatten

when: (!params.no_intervals) && step != 'annotate'
process ScatterIntervalList {
tag "$interval_list"

script:
// If the interval file is BED format, the fifth column is interpreted to
// contain runtime estimates, which is then used to combine short-running jobs
if (hasExtension(intervals, "bed"))
"""
awk -vFS="\t" '{
t = \$5 # runtime estimate
if (t == "") {
# no runtime estimate in this row, assume default value
t = (\$3 - \$2) / ${params.nucleotidesPerSecond}
}
if (name == "" || (chunk > 600 && (chunk + t) > longest * 1.05)) {
# start a new chunk
name = sprintf("%s_%d-%d.bed", \$1, \$2+1, \$3)
chunk = 0
longest = 0
}
if (t > longest)
longest = t
chunk += t
print \$0 > name
}' ${intervals}
"""
else if (hasExtension(intervals, "interval_list"))
"""
grep -v '^@' ${intervals} | awk -vFS="\t" '{
name = sprintf("%s_%d-%d", \$1, \$2, \$3);
printf("%s\\t%d\\t%d\\n", \$1, \$2-1, \$3) > name ".bed"
}'
"""
else
"""
awk -vFS="[:-]" '{
name = sprintf("%s_%d-%d", \$1, \$2, \$3);
printf("%s\\t%d\\t%d\\n", \$1, \$2-1, \$3) > name ".bed"
}' ${intervals}
"""
}
input:
file(interval_list) from ch_intervals

bedIntervals = bedIntervals
.map { intervalFile ->
def duration = 0.0
for (line in intervalFile.readLines()) {
final fields = line.split('\t')
if (fields.size() >= 5) duration += fields[4].toFloat()
else {
start = fields[1].toInteger()
end = fields[2].toInteger()
duration += (end - start) / params.nucleotidesPerSecond
}
}
[duration, intervalFile]
}.toSortedList({ a, b -> b[0] <=> a[0] })
.flatten().collate(2)
.map{duration, intervalFile -> intervalFile}
output:
file('out/*/*.interval_list') into scattered_interval_list mode flatten

bedIntervals = bedIntervals.dump(tag:'bedintervals')
script:
"""
mkdir out
java -Xms1g -jar /usr/gitc/picard.jar \
IntervalListTools \
SCATTER_COUNT=${params.scatter_count} \
SUBDIVISION_MODE=BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
UNIQUE=true \
SORT=true \
BREAK_BANDS_AT_MULTIPLES_OF=${params.break_bands_at_multiples_of} \
INPUT=${interval_list} \
OUTPUT=out
if (params.no_intervals && step != 'annotate') bedIntervals = Channel.from(file("no_intervals.bed"))
count_intervals.py
"""
}

(intBaseRecalibrator, intApplyBQSR, intHaplotypeCaller, intMpileup, bedIntervals) = bedIntervals.into(5)
(intBaseRecalibrator, intApplyBQSR, intHaplotypeCaller, intMpileup, scattered_interval_list) = scattered_interval_list.into(5)

// PREPARING CHANNELS FOR PREPROCESSING AND QC

Expand Down Expand Up @@ -768,7 +727,6 @@ else inputPairReadsSentieon.close()
process MapReads {
label 'cpus_max'
label 'memory_max'
echo true

tag {idPatient + "-" + idRun}

Expand Down Expand Up @@ -1053,7 +1011,7 @@ process SentieonDedup {
// STEP 3: CREATING RECALIBRATION TABLES

process BaseRecalibrator {
label 'med_resources'
label 'cpus_1'

tag {idPatient + "-" + idSample + "-" + intervalBed.baseName}

Expand Down Expand Up @@ -1105,8 +1063,8 @@ if (params.no_intervals) {
// STEP 3.5: MERGING RECALIBRATION TABLES

process GatherBQSRReports {
label 'memory_singleCPU_2_task'
label 'cpus_2'

label 'cpus_1'

tag {idPatient + "-" + idSample}

Expand Down Expand Up @@ -1174,8 +1132,7 @@ bamApplyBQSR = bamApplyBQSR.dump(tag:'BAM + BAI + RECAL TABLE + INT')

process ApplyBQSR {

label 'memory_singleCPU_2_task'
label 'cpus_2'
label 'cpus_1'

tag {idPatient + "-" + idSample + "-" + intervalBed.baseName}

Expand Down Expand Up @@ -1296,7 +1253,8 @@ bamRecalSentieonSampleTSV
// STEP 4.5.1: MERGING THE RECALIBRATED BAM FILES

process MergeBamRecal {
label 'med_resources'
label 'cpus_max'
label 'memory_max'

tag {idPatient + "-" + idSample}

Expand Down Expand Up @@ -1415,7 +1373,7 @@ bamRecalSampleTSV
// STEP 5: QC

process SamtoolsStats {
label 'cpus_2'
label 'cpus_1'

tag {idPatient + "-" + idSample}

Expand All @@ -1440,8 +1398,7 @@ samtoolsStatsReport = samtoolsStatsReport.dump(tag:'SAMTools')
bamBamQC = bamMappedBamQC.mix(bamRecalBamQC)

process BamQC {
label 'memory_max'
label 'cpus_max'
label 'med_resources'

tag {idPatient + "-" + idSample}

Expand Down Expand Up @@ -1508,10 +1465,9 @@ bamHaplotypeCaller = bamRecalAllTemp.combine(intHaplotypeCaller)

process HaplotypeCaller {

label 'forks_max'
label 'cpus_1'
label 'cpus_1'

tag {idSample + "-" + intervalBed.baseName}
tag {idSample + "-" + interval_list.baseName}

input:
set idPatient, idSample, file(bam), file(bai), file(intervalBed) from bamHaplotypeCaller
Expand Down Expand Up @@ -1548,6 +1504,8 @@ else gvcfHaplotypeCaller = gvcfHaplotypeCaller.dump(tag:'GVCF HaplotypeCaller')
// STEP GATK HAPLOTYPECALLER.2

process GenotypeGVCFs {
label 'memory_max'

tag {idSample + "-" + intervalBed.baseName}

input:
Expand All @@ -1561,7 +1519,7 @@ process GenotypeGVCFs {
output:
set val("HaplotypeCaller"), idPatient, idSample, file("${intervalBed.baseName}_${idSample}.vcf") into vcfGenotypeGVCFs

when: 'haplotypecaller' in tools
when: !(params.noGVCF) && ('haplotypecaller' in tools)

script:
// Using -L is important for speed and we have to index the interval files also
Expand Down Expand Up @@ -1838,7 +1796,7 @@ pairBam = pairBam.dump(tag:'BAM Somatic Pair')
// Manta, Strelka, Mutect2
(pairBamManta, pairBamStrelka, pairBamStrelkaBP, pairBamCalculateContamination, pairBamFilterMutect2, pairBamTNscope, pairBam) = pairBam.into(7)

intervalPairBam = pairBam.spread(bedIntervals)
intervalPairBam = pairBam.spread(scattered_interval_list)

bamMpileup = bamMpileup.spread(intMpileup)

Expand All @@ -1849,7 +1807,6 @@ bamMpileup = bamMpileup.spread(intMpileup)

process FreeBayes {

label 'forks_max'
label 'cpus_1'

tag {idSampleTumor + "_vs_" + idSampleNormal + "-" + intervalBed.baseName}
Expand Down Expand Up @@ -1889,7 +1846,6 @@ vcfFreeBayes = vcfFreeBayes.groupTuple(by:[0,1,2])
process Mutect2 {
tag {idSampleTumor + "_vs_" + idSampleNormal + "-" + intervalBed.baseName}

label 'forks_max'
label 'cpus_1'


Expand Down Expand Up @@ -2019,7 +1975,6 @@ vcfConcatenated = vcfConcatenated.dump(tag:'VCF')
process PileupSummariesForMutect2 {
tag {idSampleTumor + "_vs_" + idSampleNormal + "_" + intervalBed.baseName }

label 'forks_max'
label 'cpus_1'

input:
Expand Down Expand Up @@ -2052,7 +2007,6 @@ pileupSummaries = pileupSummaries.groupTuple(by:[0,1])

process MergePileupSummaries {

label 'forks_max'
label 'cpus_1'

tag {idPatient + "_" + idSampleTumor}
Expand Down Expand Up @@ -2082,7 +2036,6 @@ process MergePileupSummaries {

process CalculateContamination {

label 'forks_max'
label 'cpus_1'

tag {idSampleTumor + "_vs_" + idSampleNormal}
Expand Down Expand Up @@ -2112,7 +2065,6 @@ process CalculateContamination {

process FilterMutect2Calls {

label 'forks_max'
label 'cpus_1'

tag {idSampleTN}
Expand Down Expand Up @@ -2388,7 +2340,7 @@ vcfStrelkaBP = vcfStrelkaBP.dump(tag:'Strelka BP')
// Run commands and code from Malin Larsson
// Based on Jesper Eisfeldt's code
process AlleleCounter {
label 'memory_singleCPU_2_task'
label 'cpus_2'

tag {idSample}

Expand Down Expand Up @@ -2711,7 +2663,6 @@ vcfKeep = Channel.empty().mix(

process BcftoolsStats {

label 'forks_max'
label 'cpus_1'

tag {"${variantCaller} - ${vcf}"}
Expand All @@ -2736,7 +2687,6 @@ bcftoolsReport = bcftoolsReport.dump(tag:'BCFTools')

process Vcftools {

label 'forks_max'
label 'cpus_1'

tag {"${variantCaller} - ${vcf}"}
Expand Down Expand Up @@ -3055,8 +3005,7 @@ compressVCFOutVEP = compressVCFOutVEP.dump(tag:'VCF')

process MultiQC {

label 'cpus_max'
label 'memory_max'
label 'cpus_2'

publishDir "${params.outdir}/MultiQC", mode: params.publishDirMode

Expand Down
Loading

0 comments on commit 036348c

Please sign in to comment.