Markdups process marks aligner output as markdups output incorrectly #104

mvanniekerkHartwig · 2024-10-18T09:51:02Z

Description of the bug

When markdups is finished, the nextflow markdups process incorrectly marks the aligner outputs (inputs for markdups) as output files. This causes unnecessary copying of the bams, especially when using a "scratch directory".

Command used and terminal output

No response

Relevant files

No response

System information

No response

scwatts · 2024-10-19T02:31:29Z

Hi @mvanniekerkHartwig, thanks for opening the issue and PR. I haven't been able to reproduce publishing of MarkDups input BAMs - here are the published outputs that I see when trying to replicate:

$ tree output.*/subject_a/
output.1.0.0/subject_a/
└── alignments
    └── dna
        ├── subject_a.normal.duplicate_freq.tsv
        ├── subject_a.normal.markdups.bam
        ├── subject_a.normal.markdups.bam.bai
        ├── subject_a.tumor.duplicate_freq.tsv
        ├── subject_a.tumor.markdups.bam
        └── subject_a.tumor.markdups.bam.bai
output.dev/subject_a/
└── alignments
    └── dna
        ├── subject_a.normal.duplicate_freq.tsv
        ├── subject_a.normal.markdups.bam
        ├── subject_a.normal.markdups.bam.bai
        ├── subject_a.tumor.duplicate_freq.tsv
        ├── subject_a.tumor.markdups.bam
        └── subject_a.tumor.markdups.bam.bai

These outputs were generated with the following commands:

Click to show commands

# Generate samplesheet
cat <<EOF > samplesheet.csv
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
subject_a,subject_a,subject_a.tumor,tumor,dna,fastq,library_id:library_1;lane:001,https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.tumor.dna.R1.fastq.gz;https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.tumor.dna.R2.fastq.gz
subject_a,subject_a,subject_a.normal,normal,dna,fastq,library_id:library_1;lane:001,https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.normal.dna.R1.fastq.gz;https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.normal.dna.R2.fastq.gz
EOF

# Run with oncoanalyser dev branch
nextflow run nf-core/oncoanalyser \
  -revision dev \
  -latest \
  -profile docker \
  \
  --genome GRCh38_hmf \
  --mode wgts \
  \
  --processes_manual \
  --processes_include alignment,markdups \
  \
  --max_cpus 4 \
  --max_memory 30.GB \
  \
  --input samplesheet.csv \
  --outdir output.dev/

# Run with oncoanalyser 1.0.0
nextflow run nf-core/oncoanalyser \
  -revision 1.0.0 \
  -latest \
  -profile docker \
  \
  --genome GRCh38_hmf \
  --mode wgts \
  \
  --processes_manual \
  --processes_include alignment,markdups \
  \
  --max_cpus 4 \
  --max_memory 30.GB \
  \
  --input samplesheet.csv \
  --outdir output.1.0.0/

To help understand, could you post your output directory contents and command used where you were finding MarkDups inputs being published?

scwatts · 2024-10-19T02:32:49Z

For some further information the Nextflow output path glob patterns will exclude inputs from matches by default, and this behaviour should only change after setting includeInputs i.e. path(pattern, includeInputs: true)

mvanniekerkHartwig · 2024-10-21T07:05:47Z

Hey Stephen, you are correct, my description was imprecise. Indeed, the markdups output directory does not contain the input BAMs.

The purpose of my change was to avoid copying the input BAMs from the scratch to the work directory. We are testing out running oncoanalyser in kubernetes in GCP. The work directory lives in a cloud storage bucket, so it can be shared among all processes. The scratch directory is ephemeral storage on a local SSD. This code change avoids the copying of the input BAMs from the scratch to the work directory. I found it by inspecting the generated shell script that runs the process.

Showing the full config for our setup is not trivial, since the configuration is dependent on the k8s yaml, the gcp k8s cluster and cloud storage setup, a nextflow config file, and a samplesheet. If you are interested, I can schedule a call where I walk you through our setup? So far, I've ran into some minor issues, but nothing blocking. I'm quite confident we should be able to run oncoanalyser in GCP in a cost-effective and stable manner.

scwatts · 2024-10-24T21:52:10Z

Forgive the slow response, I'm back from leave today

Thanks for the additional context. This seems like something that should be fixed upstream if I understand correctly, though happy to accommodate at this level as well. If possible, would you be able to test that your changes have the desired effect?

Are you able to also give some information on the shell script you inspected? e.g. name, location, source

I'll take you up on that offer and email you to organise a call next week, I'm particularly interested to understand your GCP / k8s setup and whether there are other improvements to be made for oncoanalyser

mvanniekerkHartwig · 2024-10-25T11:20:43Z

I agree, ideally this would be fixed by nextflow itself.

Nextflow output path glob patterns will exclude inputs from matches by default

It seems like the copying from scratch to workdir does not respect this, but I'm not experienced enough with nextflow to declare this is definitely a bug.

The copying behaviour from scratch to workdir is done by the .command.run script for the process. I'll paste the relevant snippet from the script (this is from the version of oncoanalyser that includes my patch):

nxf_unstage() {
    true
    cp .command.out /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.out || true
    cp .command.err /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.err || true
    cp .command.trace /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.trace || true
    [[ ${nxf_main_ret:=0} != 0 ]] && return
    IFS=$'\n'
    for name in $(eval "ls -1d *.markdups.bam *.markdups.bam.bai versions.yml *.tsv" | sort | uniq); do
        nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
    done
    unset IFS
}

Note the

for name in $(eval "ls -1d *.markdups.bam *.markdups.bam.bai versions.yml *.tsv" | sort | uniq); do
    nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done

So the file selector that copies from the scratch to the work dir is pretty much an exact copy from the output directive in the process. Before my change this was:

for name in $(eval "ls -1d *bam *bai versions.yml *.tsv" | sort | uniq); do
    nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done

This extra copy affects the total runtime of the markdups process quite severely, because effectively it's copying the bam twice from the scratch to the workdir.

Interestingly (and completely out of scope for this specific issue), I found that the copying from the workdir to the outputdir is done by the "main" nextflow process, and not included in the .command.run script. It would make sense for nextflow to change this behaviour to let this be done by each individual process since (especially when copying BAMs around) this can be quite a resource-intensive task.

scwatts · 2024-10-30T00:01:13Z

Resolved in #105

mvanniekerkHartwig added the bug Something isn't working label Oct 18, 2024

mvanniekerkHartwig mentioned this issue Oct 18, 2024

Fix "Markdups process marks aligner output as markdups output incorrectly" #105

Merged

scwatts closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdups process marks aligner output as markdups output incorrectly #104

Markdups process marks aligner output as markdups output incorrectly #104

mvanniekerkHartwig commented Oct 18, 2024

scwatts commented Oct 19, 2024

scwatts commented Oct 19, 2024

mvanniekerkHartwig commented Oct 21, 2024

scwatts commented Oct 24, 2024

mvanniekerkHartwig commented Oct 25, 2024

scwatts commented Oct 30, 2024

Markdups process marks aligner output as markdups output incorrectly #104

Markdups process marks aligner output as markdups output incorrectly #104

Comments

mvanniekerkHartwig commented Oct 18, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

scwatts commented Oct 19, 2024

scwatts commented Oct 19, 2024

mvanniekerkHartwig commented Oct 21, 2024

scwatts commented Oct 24, 2024

mvanniekerkHartwig commented Oct 25, 2024

scwatts commented Oct 30, 2024