-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markdups process marks aligner output as markdups output incorrectly #104
Comments
Hi @mvanniekerkHartwig, thanks for opening the issue and PR. I haven't been able to reproduce publishing of MarkDups input BAMs - here are the published outputs that I see when trying to replicate:
These outputs were generated with the following commands: Click to show commands# Generate samplesheet
cat <<EOF > samplesheet.csv
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
subject_a,subject_a,subject_a.tumor,tumor,dna,fastq,library_id:library_1;lane:001,https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.tumor.dna.R1.fastq.gz;https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.tumor.dna.R2.fastq.gz
subject_a,subject_a,subject_a.normal,normal,dna,fastq,library_id:library_1;lane:001,https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.normal.dna.R1.fastq.gz;https://github.com/nf-core/test-datasets/raw/oncoanalyser/sample_data/simulated_reads/wgts/fastq/single_lane_single_library/subject_a.normal.dna.R2.fastq.gz
EOF
# Run with oncoanalyser dev branch
nextflow run nf-core/oncoanalyser \
-revision dev \
-latest \
-profile docker \
\
--genome GRCh38_hmf \
--mode wgts \
\
--processes_manual \
--processes_include alignment,markdups \
\
--max_cpus 4 \
--max_memory 30.GB \
\
--input samplesheet.csv \
--outdir output.dev/
# Run with oncoanalyser 1.0.0
nextflow run nf-core/oncoanalyser \
-revision 1.0.0 \
-latest \
-profile docker \
\
--genome GRCh38_hmf \
--mode wgts \
\
--processes_manual \
--processes_include alignment,markdups \
\
--max_cpus 4 \
--max_memory 30.GB \
\
--input samplesheet.csv \
--outdir output.1.0.0/ To help understand, could you post your output directory contents and command used where you were finding MarkDups inputs being published? |
For some further information the Nextflow output path glob patterns will exclude inputs from matches by default, and this behaviour should only change after setting includeInputs i.e. |
Hey Stephen, you are correct, my description was imprecise. Indeed, the markdups output directory does not contain the input BAMs. The purpose of my change was to avoid copying the input BAMs from the scratch to the work directory. We are testing out running oncoanalyser in kubernetes in GCP. The work directory lives in a cloud storage bucket, so it can be shared among all processes. The scratch directory is ephemeral storage on a local SSD. This code change avoids the copying of the input BAMs from the scratch to the work directory. I found it by inspecting the generated shell script that runs the process. Showing the full config for our setup is not trivial, since the configuration is dependent on the k8s yaml, the gcp k8s cluster and cloud storage setup, a nextflow config file, and a samplesheet. If you are interested, I can schedule a call where I walk you through our setup? So far, I've ran into some minor issues, but nothing blocking. I'm quite confident we should be able to run oncoanalyser in GCP in a cost-effective and stable manner. |
Forgive the slow response, I'm back from leave today Thanks for the additional context. This seems like something that should be fixed upstream if I understand correctly, though happy to accommodate at this level as well. If possible, would you be able to test that your changes have the desired effect? Are you able to also give some information on the shell script you inspected? e.g. name, location, source I'll take you up on that offer and email you to organise a call next week, I'm particularly interested to understand your GCP / k8s setup and whether there are other improvements to be made for oncoanalyser |
I agree, ideally this would be fixed by nextflow itself.
It seems like the copying from scratch to workdir does not respect this, but I'm not experienced enough with nextflow to declare this is definitely a bug. The copying behaviour from scratch to workdir is done by the nxf_unstage() {
true
cp .command.out /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.out || true
cp .command.err /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.err || true
cp .command.trace /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9/.command.trace || true
[[ ${nxf_main_ret:=0} != 0 ]] && return
IFS=$'\n'
for name in $(eval "ls -1d *.markdups.bam *.markdups.bam.bai versions.yml *.tsv" | sort | uniq); do
nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done
unset IFS
} Note the for name in $(eval "ls -1d *.markdups.bam *.markdups.bam.bai versions.yml *.tsv" | sort | uniq); do
nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done So the file selector that copies from the scratch to the work dir is pretty much an exact copy from the output directive in the process. Before my change this was: for name in $(eval "ls -1d *bam *bai versions.yml *.tsv" | sort | uniq); do
nxf_fs_copy "$name" /workspace/oncoanalyser/executions/colo-37-fastq/work/ca/c0a18788ba4e8421cf07e4baf49be9 || true
done This extra copy affects the total runtime of the markdups process quite severely, because effectively it's copying the bam twice from the scratch to the workdir. Interestingly (and completely out of scope for this specific issue), I found that the copying from the workdir to the outputdir is done by the "main" nextflow process, and not included in the |
Resolved in #105 |
Description of the bug
When markdups is finished, the nextflow markdups process incorrectly marks the aligner outputs (inputs for markdups) as output files. This causes unnecessary copying of the bams, especially when using a "scratch directory".
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: