Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdups process marks aligner output as markdups output incorrectly #104

Closed
mvanniekerkHartwig opened this issue Oct 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@mvanniekerkHartwig
Copy link

Description of the bug

When markdups is finished, the nextflow markdups process incorrectly marks the aligner outputs (inputs for markdups) as output files. This causes unnecessary copying of the bams, especially when using a "scratch directory".

Command used and terminal output

No response

Relevant files

No response

System information

No response

@scwatts
Copy link
Collaborator

scwatts commented Oct 19, 2024

Hi @mvanniekerkHartwig, thanks for opening the issue and PR. I haven't been able to reproduce publishing of MarkDups input BAMs - here are the published outputs that I see when trying to replicate:

$ tree output.*/subject_a/
output.1.0.0/subject_a/
└── alignments
    └── dna
        ├── subject_a.normal.duplicate_freq.tsv
        ├── subject_a.normal.markdups.bam
        ├── subject_a.normal.markdups.bam.bai
        ├── subject_a.tumor.duplicate_freq.tsv
        ├── subject_a.tumor.markdups.bam
        └── subject_a.tumor.markdups.bam.bai
output.dev/subject_a/
└── alignments
    └── dna
        ├── subject_a.normal.duplicate_freq.tsv
        ├── subject_a.normal.markdups.bam
        ├── subject_a.normal.markdups.bam.bai
        ├── subject_a.tumor.duplicate_freq.tsv
        ├── subject_a.tumor.markdups.bam
        └── subject_a.tumor.markdups.bam.bai

These outputs were generated with the following commands:

Click to show commands
# Generate samplesheet
cat <<EOF > samplesheet.csv
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
subject_a,subject_a,subject_a.tumor,tumor,dna,fastq,library_id:library_1;lane:001,https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.tumor.dna.R1.fastq.gz;https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.tumor.dna.R2.fastq.gz
subject_a,subject_a,subject_a.normal,normal,dna,fastq,library_id:library_1;lane:001,https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.normal.dna.R1.fastq.gz;https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.normal.dna.R2.fastq.gz
EOF

# Run with oncoanalyser dev branch
nextflow run nf-core/oncoanalyser \
  -revision dev \
  -latest \
  -profile docker \
  \
  --genome GRCh38_hmf \
  --mode wgts \
  \
  --processes_manual \
  --processes_include alignment,markdups \
  \
  --max_cpus 4 \
  --max_memory 30.GB \
  \
  --input samplesheet.csv \
  --outdir output.dev/

# Run with oncoanalyser 1.0.0
nextflow run nf-core/oncoanalyser \
  -revision 1.0.0 \
  -latest \
  -profile docker \
  \
  --genome GRCh38_hmf \
  --mode wgts \
  \
  --processes_manual \
  --processes_include alignment,markdups \
  \
  --max_cpus 4 \
  --max_memory 30.GB \
  \
  --input samplesheet.csv \
  --outdir output.1.0.0/

To help understand, could you post your output directory contents and command used where you were finding MarkDups inputs being published?

@scwatts
Copy link
Collaborator

scwatts commented Oct 19, 2024

For some further information the Nextflow output path glob patterns will exclude inputs from matches by default, and this behaviour should only change after setting includeInputs i.e. path(pattern, includeInputs: true)

@mvanniekerkHartwig
Copy link
Author

Hey Stephen, you are correct, my description was imprecise. Indeed, the markdups output directory does not contain the input BAMs.

The purpose of my change was to avoid copying the input BAMs from the scratch to the work directory. We are testing out running oncoanalyser in kubernetes in GCP. The work directory lives in a cloud storage bucket, so it can be shared among all processes. The scratch directory is ephemeral storage on a local SSD. This code change avoids the copying of the input BAMs from the scratch to the work directory. I found it by inspecting the generated shell script that runs the process.

Showing the full config for our setup is not trivial, since the configuration is dependent on the k8s yaml, the gcp k8s cluster and cloud storage setup, a nextflow config file, and a samplesheet. If you are interested, I can schedule a call where I walk you through our setup? So far, I've ran into some minor issues, but nothing blocking. I'm quite confident we should be able to run oncoanalyser in GCP in a cost-effective and stable manner.

@scwatts
Copy link
Collaborator

scwatts commented Oct 24, 2024

Forgive the slow response, I'm back from leave today

Thanks for the additional context. This seems like something that should be fixed upstream if I understand correctly, though happy to accommodate at this level as well. If possible, would you be able to test that your changes have the desired effect?

Are you able to also give some information on the shell script you inspected? e.g. name, location, source

I'll take you up on that offer and email you to organise a call next week, I'm particularly interested to understand your GCP / k8s setup and whether there are other improvements to be made for oncoanalyser

@mvanniekerkHartwig
Copy link
Author

I agree, ideally this would be fixed by nextflow itself.

Nextflow output path glob patterns will exclude inputs from matches by default

It seems like the copying from scratch to workdir does not respect this, but I'm not experienced enough with nextflow to declare this is definitely a bug.

The copying behaviour from scratch to workdir is done by the .command.run script for the process. I'll paste the relevant snippet from the script (this is from the version of oncoanalyser that includes my patch):

nxf_unstage() {
    true
    cp .command.out /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.out || true
    cp .command.err /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.err || true
    cp .command.trace /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.trace || true
    [[ ${nxf_main_ret:=0} != 0 ]] && return
    IFS=$'\n'
    for name in $(eval "ls -1d *.markdups.bam *.markdups.bam.bai versions.yml *.tsv" | sort | uniq); do
        nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
    done
    unset IFS
}

Note the

for name in $(eval "ls -1d *.markdups.bam *.markdups.bam.bai versions.yml *.tsv" | sort | uniq); do
    nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done

So the file selector that copies from the scratch to the work dir is pretty much an exact copy from the output directive in the process. Before my change this was:

for name in $(eval "ls -1d *bam *bai versions.yml *.tsv" | sort | uniq); do
    nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done

This extra copy affects the total runtime of the markdups process quite severely, because effectively it's copying the bam twice from the scratch to the workdir.

Interestingly (and completely out of scope for this specific issue), I found that the copying from the workdir to the outputdir is done by the "main" nextflow process, and not included in the .command.run script. It would make sense for nextflow to change this behaviour to let this be done by each individual process since (especially when copying BAMs around) this can be quite a resource-intensive task.

@scwatts
Copy link
Collaborator

scwatts commented Oct 30, 2024

Resolved in #105

@scwatts scwatts closed this as completed Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants