Absorb refactoring for flexible interval creation (#8)

* Absorbs latest nfcore and cgpu fork changes (#6) * nf-core bump-version . 2.5.1dev * Remove PublishDirMode from test profile (nf-core#40) * remove PublishDirMode from test profile * update all tools * minor updates + typo fix (nf-core#42) * minor updates + typo fix * fix VEP automated builds * add location for abstracts * remove reference to old buil.nf script * update CHANGELOG * Update docs/reference.md Co-Authored-By: Szilveszter Juhos <[email protected]> * Update docs/reference.md * Update docs/reference.md * Worfklow (nf-core#45) * Add workflow figure * Include workflow figure in readme * Update CHANGELOG * add minimal genome and update some processes * Start adding mouse data * Update iGenomes.config * Add tbi * Drop ASCAT files * apply changes from 2.5.1 to dev * bump version to 2.5.2dev * update CHANGELOG * update tiddit to 2.8.1 * Use Version 98 of Mouse * Add for grcm38 * Adjust mus musculus DB * Annotation * add smallerGRCh37 and minimalGRCh37 * use bwa aln when no knowIndels, otherwise use bwa mem, noIntervals currently in the process of being added everywhere * don't use bwa aln * add automatic generation of intervals file based on fastaFai file * Adjusted genomes.config * Should be list * Set genomes_base to something * Revert back * enable CreateIntervalsBed for intervals_list from GATK Bundle * Add proper calling list * Use the bed file * remove temp file * update CHANGELOG * Fix genome fa.fai * Add in mgpv5 * Try short track * Add in species handling * Document new parameter species * Add changelog * Fix iGenomes stuff * Add in note about GRCm38 * Fix small fai index issue * Adjusted quotes in genomes.config * And the same for igenomes * Better folder structure for Mouse Genome Project data * Minor adjustment to propoer paths * Apply suggestions from code review Add changes by Maxime Co-Authored-By: Maxime Garcia <[email protected]> * Remove space * Move it up * Update CHANGELOG.md Co-Authored-By: Maxime Garcia <[email protected]> * add minimal tests * fix processes with no intervals * add comments * params noIntervals -> no_intervals * sort genomes + add news * code polishing * update CHANGELOG * add split_fastq params to split the fastq files with the splitFastq() nf method * add tests * temporarely remove TIDDIT tests * add sention for bwa mem * disable docker and singularity * disable container * add fastaFai for bwamem * remove module samtools from label sentieon * fix output from bwa mem * fix output channel BamMapped from MapReads * set params.sentieon to null by default * add SentieonDedup process * fix typo * add fastaFai to SentieonDedup process * fix bam indexing * fix bam indexing * fix bam indexing * add SentieonBQSR * add label sentieon to SentieonBQSR * fix metrics output for SentieonBQSR * increase cpus for Sentieon BQSR * remove indexing * add index for dedup * bwa mem sentieon specific process * TSV file for sentieon Dedup * TSV for every step for Sentieon * recal -> deduped * fix input for TSV recalibrate * enable restart from recalibrate with TSV with Sentieon * fix sention variant calling from mapping and recalibrate * code polishing * add dump tag for imput sample * add dump tag for bamDedupedSentieon * code polishing * code polishing * code polishing * code polishing * remove when statement * fix typo * remove tsv for recalibrate with sentieon * add dnascope dnaseq * fix dnascope * add TNscope process * fix TNscope output * add pon for TNscope * add params.pon_index * add annotation for sention DNAseq, DNAscope, TNscope * add default pon_index * typo * fix typo * improve automatic annotation * typo * typo * add condition on when statement on TNscope * clean up * code polish * add CODEOWNERS file * add when statement on all sentieon processes with params.sentieon * remove munin sentieon specific configs from config * load sarek specific config * update path to specific config * update docs * remove Freebayes * update workflow image * remove old logo * fix tests * add docs about params split_fastq * update CHANGELOG * improve docs * more tests but less NF versions * actually run the tests * typo * simplify configs * add test for mpileup * go crazy with tests * fix tests * includ test.config * restore FreeBayes * remove label memory_max from BaseRecalibrator process to fix nf-core#72 * add --skipQC all and --tools Manta,mpileup,Strelka to minimal genome tests * update Nextflow version * update Nextflow version * update Nextflow * add --step annotation to profile * don't need to specify step here * move params initalization * add docs * fix markdownlint * more complete docs + sort genomes * improve tests * update docs * update CHANGELOG * improve script * fix tests * better comments * better comments * fix error on channel name * fix output for MergeBamRecal * fix MergeBamRecal output * fix TSV file * update comments and docs * add warning for sentieon only processes * nf-core bump-version . 2.5.2 * manual bump-version . 2.5.2 * update workflow image * downgrade tools for release * update CHANGELOG * clean up and update workflow image * allow a * fix workflow image * Apply suggestions from code review * Apply suggestions from code review * Update docs/output.md * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Reformats `bwa mem | samtools sort` command; WIP suboptimal resource usage * Addresses #5 ;WIP * Removes max_ resource alloc labels from MarkDuplicatesSpark * Replaces .md.bam.bai->.md.bai (same as nf-core) * Add ${markdup_java_options} to MarkDuplicatesSpark (same as nf-core MarkDuplicates) * Changes MarkDuplicates --verbosity, DEBUG->INFO * Changes intervalBed.simpleName->intervalBed.baseName; nf-cored * Removes label cpus_1 from BaseRecalibratorSpark * Remove cpus_2 labels from ApplyBQSRSpark; DEBUG->INFO * Changes pseudo file "no_vepFile.txt" from https to s3 link * Removes java options from ApplyBQSRSpark * Removes java options from MarkDupesSpark * Add java-options to MarkDupesSpark; verbosity INFO->ERROR * Fixes dupe --java-options; 🤦 * Attempt to fix MarkDupesSpark; "--lower-case"->"-CAP"; Removed tmp * Adds soft-coded allocation of resources to MapReads * Initialise params for MapReads split resource alloc * Adds neglected curlies around params * Adds neglected \ to bash vars * Adds neglected \ to bash vars * WIP; MapReads optimisations * Implement resource alloc between bwa and samtools * Adds max, med soft coded resource alloc * Re-labels processes (from hard coded resources to soft) * Adds extra curlies to addrees priority of eval * Add explicit declaration of maxForks/process * Update med resource allocation function * Add echo true and echo of ${bwa_cpus} and ${sort_cpus} * Hard code heap in MarkDuplicatesSpark at 8g * Correct expected output bai in MarkDupes * Removes Spark versions; Not stable with low resources * Removes sorting; Picard might sort? * Do not assume sorting in MarkDupes * Adds explicit --ASSUME_SORT_ORDER unsorted * Adds missing \\ * Omits -k 23 * Bringing sorted back * Eliminating pipes in mapping step * Adds bwa -k 23 and GenomeChronicler as tool (cgpu#16) - [x] Adds -k 23 (bwa mem seed length) - [x] Exposes as params bwa_cpus, sort_cpus - [x] Adds GenomeChronicler in tools (sarek logic) Co-authored-by: Alexander Peltzer <[email protected]> Co-authored-by: Maxime Garcia <[email protected]> Co-authored-by: Szilveszter Juhos <[email protected]> * Replaces intervals process to be flexible * Updated nextflow.config with new intervals process * Updates conf/base.config; Removes dynamic resource alloc Co-authored-by: Alexander Peltzer <[email protected]> Co-authored-by: Maxime Garcia <[email protected]> Co-authored-by: Szilveszter Juhos <[email protected]>
PGP-UK · Jan 21, 2020 · 036348c · 036348c
1 parent d54f075
commit 036348c
Show file tree

Hide file tree

Showing 3 changed files with 70 additions and 165 deletions.
diff --git a/conf/base.config b/conf/base.config
@@ -10,9 +10,7 @@
  */
 
 process {
-  cpus = {check_resource(params.cpus * task.attempt)}
-  memory = {check_resource((params.singleCPUMem as nextflow.util.MemoryUnit) * task.attempt)}
-  time = {check_resource(24.h * task.attempt)}
+
   shell = ['/bin/bash', '-euo', 'pipefail']
 
   errorStrategy = {task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish'}
@@ -21,9 +19,11 @@ process {
 
   withLabel:cpus_1 {
     cpus = {check_resource(1)}
+    memory = 7.5.GB
   }
   withLabel:cpus_2 {
     cpus = {check_resource(2)}
+    memory = 15.GB
   }
   withLabel:cpus_4 {
     cpus = {check_resource(4)}
@@ -37,18 +37,28 @@ process {
   withLabel:cpus_max {
     cpus = {params.max_cpus}
   }
-
-  withLabel:memory_singleCPU_2_task {
-    memory = {check_resource((params.singleCPUMem as nextflow.util.MemoryUnit) * 2 * task.attempt)}
-  }
-  withLabel:memory_singleCPU_task_sq {
-    memory = {check_resource((params.singleCPUMem as nextflow.util.MemoryUnit) * task.attempt * task.attempt)}
-  }
-
   withLabel:memory_max {
     memory = {params.max_memory}
   }
-
+  withName: MarkDuplicates {
+      maxForks = 2
+  }
+  withName: BaseRecalibrator {
+    maxForks = 32
+  }
+  withName: ApplyBQSRS {
+    maxForks = 32
+  }
+  withName: ScatterIntervalList {
+    container = 'us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.1-1540490856'
+  }
+  withName: HaplotypeCaller {
+    cpus = {check_resource(1)}
+    memory = 7.5.GB
+    maxForks = 30
+    maxRetries = params.preemptible_tries
+    container = 'us.gcr.io/broad-gatk/gatk:4.0.10.1'
+  }
   withName:ConcatVCF {
     // For unknown reasons, ConcatVCF sometimes fails with SIGPIPE
     // (exit code 141). Rerunning the process will usually work.

diff --git a/main.nf b/main.nf
@@ -589,78 +589,37 @@ ch_intervals = params.no_intervals ? "null" : params.intervals && !('annotate' i
 
 // STEP 0: CREATING INTERVALS FOR PARALLELIZATION (PREPROCESSING AND VARIANT CALLING)
 
-process CreateIntervalBeds {
-    tag {intervals.fileName}
+// This task calls picard's IntervalListTools to scatter the input interval list into scatter_count sub interval lists
+// Note that the number of sub interval lists may not be exactly equal to scatter_count.  There may be slightly more or less.
+// Thus we have the block of python to count the number of generated sub interval lists.
 
-    input:
-        file(intervals) from ch_intervals
-
-    output:
-        file '*.bed' into bedIntervals mode flatten
-
-    when: (!params.no_intervals) && step != 'annotate'
+process ScatterIntervalList {
+  tag "$interval_list"
 
-    script:
-    // If the interval file is BED format, the fifth column is interpreted to
-    // contain runtime estimates, which is then used to combine short-running jobs
-    if (hasExtension(intervals, "bed"))
-        """
-        awk -vFS="\t" '{
-          t = \$5  # runtime estimate
-          if (t == "") {
-            # no runtime estimate in this row, assume default value
-            t = (\$3 - \$2) / ${params.nucleotidesPerSecond}
-          }
-          if (name == "" || (chunk > 600 && (chunk + t) > longest * 1.05)) {
-            # start a new chunk
-            name = sprintf("%s_%d-%d.bed", \$1, \$2+1, \$3)
-            chunk = 0
-            longest = 0
-          }
-          if (t > longest)
-            longest = t
-          chunk += t
-          print \$0 > name
-        }' ${intervals}
-        """
-    else if (hasExtension(intervals, "interval_list"))
-        """
-        grep -v '^@' ${intervals} | awk -vFS="\t" '{
-          name = sprintf("%s_%d-%d", \$1, \$2, \$3);
-          printf("%s\\t%d\\t%d\\n", \$1, \$2-1, \$3) > name ".bed"
-        }'
-        """
-    else
-        """
-        awk -vFS="[:-]" '{
-          name = sprintf("%s_%d-%d", \$1, \$2, \$3);
-          printf("%s\\t%d\\t%d\\n", \$1, \$2-1, \$3) > name ".bed"
-        }' ${intervals}
-        """
-}
+  input:
+  file(interval_list) from ch_intervals
 
-bedIntervals = bedIntervals
-    .map { intervalFile ->
-        def duration = 0.0
-        for (line in intervalFile.readLines()) {
-            final fields = line.split('\t')
-            if (fields.size() >= 5) duration += fields[4].toFloat()
-            else {
-                start = fields[1].toInteger()
-                end = fields[2].toInteger()
-                duration += (end - start) / params.nucleotidesPerSecond
-            }
-        }
-        [duration, intervalFile]
-        }.toSortedList({ a, b -> b[0] <=> a[0] })
-    .flatten().collate(2)
-    .map{duration, intervalFile -> intervalFile}
+  output:
+  file('out/*/*.interval_list') into scattered_interval_list mode flatten
 
-bedIntervals = bedIntervals.dump(tag:'bedintervals')
+  script:
+  """
+  mkdir out
+  java -Xms1g -jar /usr/gitc/picard.jar \
+    IntervalListTools \
+    SCATTER_COUNT=${params.scatter_count} \
+    SUBDIVISION_MODE=BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
+    UNIQUE=true \
+    SORT=true \
+    BREAK_BANDS_AT_MULTIPLES_OF=${params.break_bands_at_multiples_of} \
+    INPUT=${interval_list} \
+    OUTPUT=out
 
-if (params.no_intervals && step != 'annotate') bedIntervals = Channel.from(file("no_intervals.bed"))
+  count_intervals.py
+"""
+}
 
-(intBaseRecalibrator, intApplyBQSR, intHaplotypeCaller, intMpileup, bedIntervals) = bedIntervals.into(5)
+(intBaseRecalibrator, intApplyBQSR, intHaplotypeCaller, intMpileup, scattered_interval_list) = scattered_interval_list.into(5)
 
 // PREPARING CHANNELS FOR PREPROCESSING AND QC
 
@@ -768,7 +727,6 @@ else inputPairReadsSentieon.close()
 process MapReads {
     label 'cpus_max'
     label 'memory_max'
-    echo true
 
     tag {idPatient + "-" + idRun}
 
@@ -1053,7 +1011,7 @@ process SentieonDedup {
 // STEP 3: CREATING RECALIBRATION TABLES
 
 process BaseRecalibrator {
-    label 'med_resources'
+    label 'cpus_1'
 
     tag {idPatient + "-" + idSample + "-" + intervalBed.baseName}
 
@@ -1105,8 +1063,8 @@ if (params.no_intervals) {
 // STEP 3.5: MERGING RECALIBRATION TABLES
 
 process GatherBQSRReports {
-    label 'memory_singleCPU_2_task'
-    label 'cpus_2'
+
+    label 'cpus_1'
 
     tag {idPatient + "-" + idSample}
 
@@ -1174,8 +1132,7 @@ bamApplyBQSR = bamApplyBQSR.dump(tag:'BAM + BAI + RECAL TABLE + INT')
 
 process ApplyBQSR {
 
-    label 'memory_singleCPU_2_task'
-    label 'cpus_2'
+    label 'cpus_1'
 
     tag {idPatient + "-" + idSample + "-" + intervalBed.baseName}
 
@@ -1296,7 +1253,8 @@ bamRecalSentieonSampleTSV
 // STEP 4.5.1: MERGING THE RECALIBRATED BAM FILES
 
 process MergeBamRecal {
-    label 'med_resources'
+    label 'cpus_max'
+    label 'memory_max'
 
     tag {idPatient + "-" + idSample}
 
@@ -1415,7 +1373,7 @@ bamRecalSampleTSV
 // STEP 5: QC
 
 process SamtoolsStats {
-    label 'cpus_2'
+    label 'cpus_1'
 
     tag {idPatient + "-" + idSample}
 
@@ -1440,8 +1398,7 @@ samtoolsStatsReport = samtoolsStatsReport.dump(tag:'SAMTools')
 bamBamQC = bamMappedBamQC.mix(bamRecalBamQC)
 
 process BamQC {
-    label 'memory_max'
-    label 'cpus_max'
+    label 'med_resources'
 
     tag {idPatient + "-" + idSample}
 
@@ -1508,10 +1465,9 @@ bamHaplotypeCaller = bamRecalAllTemp.combine(intHaplotypeCaller)
 
 process HaplotypeCaller {
 
-    label 'forks_max'
-    label 'cpus_1'
+  label 'cpus_1'
 
-    tag {idSample + "-" + intervalBed.baseName}
+  tag {idSample + "-" + interval_list.baseName}
 
     input:
         set idPatient, idSample, file(bam), file(bai), file(intervalBed) from bamHaplotypeCaller
@@ -1548,6 +1504,8 @@ else gvcfHaplotypeCaller = gvcfHaplotypeCaller.dump(tag:'GVCF HaplotypeCaller')
 // STEP GATK HAPLOTYPECALLER.2
 
 process GenotypeGVCFs {
+    label 'memory_max'
+
     tag {idSample + "-" + intervalBed.baseName}
 
     input:
@@ -1561,7 +1519,7 @@ process GenotypeGVCFs {
     output:
     set val("HaplotypeCaller"), idPatient, idSample, file("${intervalBed.baseName}_${idSample}.vcf") into vcfGenotypeGVCFs
 
-    when: 'haplotypecaller' in tools
+    when: !(params.noGVCF) && ('haplotypecaller' in tools)
 
     script:
     // Using -L is important for speed and we have to index the interval files also
@@ -1838,7 +1796,7 @@ pairBam = pairBam.dump(tag:'BAM Somatic Pair')
 // Manta, Strelka, Mutect2
 (pairBamManta, pairBamStrelka, pairBamStrelkaBP, pairBamCalculateContamination, pairBamFilterMutect2, pairBamTNscope, pairBam) = pairBam.into(7)
 
-intervalPairBam = pairBam.spread(bedIntervals)
+intervalPairBam = pairBam.spread(scattered_interval_list)
 
 bamMpileup = bamMpileup.spread(intMpileup)
 
@@ -1849,7 +1807,6 @@ bamMpileup = bamMpileup.spread(intMpileup)
 
 process FreeBayes {
 
-    label 'forks_max'
     label 'cpus_1'
 
     tag {idSampleTumor + "_vs_" + idSampleNormal + "-" + intervalBed.baseName}
@@ -1889,7 +1846,6 @@ vcfFreeBayes = vcfFreeBayes.groupTuple(by:[0,1,2])
 process Mutect2 {
     tag {idSampleTumor + "_vs_" + idSampleNormal + "-" + intervalBed.baseName}
 
-    label 'forks_max'
     label 'cpus_1'
 
 
@@ -2019,7 +1975,6 @@ vcfConcatenated = vcfConcatenated.dump(tag:'VCF')
 process PileupSummariesForMutect2 {
     tag {idSampleTumor + "_vs_" + idSampleNormal + "_" + intervalBed.baseName }
 
-    label 'forks_max'
     label 'cpus_1'
 
     input:
@@ -2052,7 +2007,6 @@ pileupSummaries = pileupSummaries.groupTuple(by:[0,1])
 
 process MergePileupSummaries {
 
-    label 'forks_max'
     label 'cpus_1'
 
     tag {idPatient + "_" + idSampleTumor}
@@ -2082,7 +2036,6 @@ process MergePileupSummaries {
 
 process CalculateContamination {
 
-    label 'forks_max'
     label 'cpus_1'
 
     tag {idSampleTumor + "_vs_" + idSampleNormal}
@@ -2112,7 +2065,6 @@ process CalculateContamination {
 
 process FilterMutect2Calls {
 
-    label 'forks_max'
     label 'cpus_1'
 
     tag {idSampleTN}
@@ -2388,7 +2340,7 @@ vcfStrelkaBP = vcfStrelkaBP.dump(tag:'Strelka BP')
 // Run commands and code from Malin Larsson
 // Based on Jesper Eisfeldt's code
 process AlleleCounter {
-    label 'memory_singleCPU_2_task'
+    label 'cpus_2'
 
     tag {idSample}
 
@@ -2711,7 +2663,6 @@ vcfKeep = Channel.empty().mix(
 
 process BcftoolsStats {
 
-    label 'forks_max'
     label 'cpus_1'
 
     tag {"${variantCaller} - ${vcf}"}
@@ -2736,7 +2687,6 @@ bcftoolsReport = bcftoolsReport.dump(tag:'BCFTools')
 
 process Vcftools {
 
-    label 'forks_max'
     label 'cpus_1'
 
     tag {"${variantCaller} - ${vcf}"}
@@ -3055,8 +3005,7 @@ compressVCFOutVEP = compressVCFOutVEP.dump(tag:'VCF')
 
 process MultiQC {
 
-    label 'cpus_max'
-    label 'memory_max'
+    label 'cpus_2'
 
     publishDir "${params.outdir}/MultiQC", mode: params.publishDirMode