-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sfitz by readgroup and add FastQC #62
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review in greater depth in the next couple of days but a suggestion in the meantime:
module/stats_samtools.nf
Outdated
if (task.process == "run_stats_SAMtools_sample") { | ||
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}", | ||
params.dataset_id, | ||
sm_id, | ||
[:]) | ||
outdir = "." | ||
log_suffix = "${sm_id}" | ||
} else if (task.process == "run_stats_SAMtools_library") { | ||
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}", | ||
params.dataset_id, | ||
lib_id, | ||
[:]) | ||
outdir = sm_id | ||
log_suffix = "${sm_id}/${lib_id}" | ||
} else if (task.process == "run_stats_SAMtools_readgroup") { | ||
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}", | ||
params.dataset_id, | ||
rg_id, | ||
[:]) | ||
outdir = "${sm_id}/${lib_id}" | ||
log_suffix = "${sm_id}/${lib_id}/${rg_id}" | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider re-organizing this block to something like:
if (params.stat_mode == 'library') {
filename_id = lib_id
outdir = ...
log_suffix = ...
} else if (params.stat_mode == 'readgroup') {
filename_id = rg_id
outdir = ...
log_suffix = ...
} else {
filename_id = sm_id
outdir = ...
log_suffix = ...
}
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}",
params.dataset_id,
filename_id,
[:])
and then using addParams
when importing the process to set stat_mode
.
This reduces the amount of duplicated code (ex. the standard filename generation function is only coded once) and also removes the need for the process itself to have to be imported with a specific name and instead be controlled by a parameter set during import.
@yashpatel6 this is ready for review |
Channel | ||
.fromList(params.libraries_to_process) | ||
.map { lib -> | ||
def rg_arg = lib.rgs.collect { "-r ${it}" }.join(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to change this. In a separate PR I will make this a list of readgroups and add the -r
within the processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the GIAB samples for example have > 1000 readgroups. I think I should impose a maximum number of readgroups, and I suppose libraries, that will be processed separately by SAMtools stats
. But for FastQC
they will all be processed separately as this is equivalent to running FastQC
on the FASTQ files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added parameters for maximum number of readgroups and libraries to process separately with defaults = 20.
I also parameterized a choice of levels to run FastQC
.
module/fastqc.nf
Outdated
set -euo pipefail | ||
mkdir -p ${sm_id} | ||
samtools view -F 0x900 -h ${rg_arg} ${path} | \ | ||
samtools fastq | \ | ||
fastqc \ | ||
--outdir "${sm_id}" \ | ||
--format fastq \ | ||
--extract \ | ||
--delete \ | ||
${params.fastqc_additional_options} \ | ||
stdin:${output_filename} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorelfitzgibbon
question: I think I understand what this command tries to archive but can you elaborate on this a bit (can this -F 0x900
be optional, etc)? Also, have you tried to incorporate threading? This would be probably really slow for large samples. Is it possible to use fastqc -format bam
with threading for the sample level run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tyamaguchi-ucla
-F 0x900
removes secondary and supplementary alignments. It is actually the default but I thought I'd leave it here so there is no question that its being done.- Surprisingly
fastqc -format bam
does not allow per readgroup analysis. - I have not tried to incorporate threading. All three commands allow threading. I will try it and test speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE 2: After discussion with @yashpatel6:
- Since, within metapipeline, this pipeline will eventually be used to gate the running of further pipelines,
realtime
will be more important thancpu hours
, thus the default will be 2cpus
. More than 2cpus
provides very little time advantage for the large (more typical size) readgroup. cpus
will be adjusted down to 1 for input with large numbers of readgroups.
UDPATE: I now think it would be better to avoid threading entirely, at least when running in metapipeline
.
- Testing on a sample with 112 readgroups took twice as long with 2 threads vs 1 (9.5 hours vs 4.7 hours).
/hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-add-mosdepth/slurm-184021.out
/hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-add-mosdepth/slurm-184029.out
- The per cpu benefit of 2 threads on samples with fewer readgroups is minimal, if even real.
Below are the results for running with various threads on SGSALASC000015-T011-O04-F.bam
(185 GB).
- This sample has five readgroups with one being much larger than the others.
- All output here:
/hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup
- It looks like the sweet spot is 2 threads
Speedup (single thread realtime/realtime)
threads | 1 | 2 | 3 | 4 | 6 | 12 |
---|---|---|---|---|---|---|
small readgroup | 1 | 1.9 | 2.1 | 3.8 | 4 | 3.4 |
large readgroup | 1 | 2 | 2.5 | 2.2 | 2.6 | 2.5 |
total cpu hours | 9.3 | 9.2 | 12.2 | 12.4 | 16.7 | 37.0 |
1 thread; full run CPU HOURS = 9.3
name | threads | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|
assess_ReadQuality_FastQC (2) | 1 | 1h 24m 33s | 97.50% | 444.2 MB | 3.3 GB | 369.5 GB | 185.2 GB |
assess_ReadQuality_FastQC (5) | 1 | 1h 27m 31s | 99.70% | 444.8 MB | 3.3 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (4) | 1 | 1h 27m 31s | 99.70% | 447.5 MB | 3.3 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (3) | 1 | 1h 27m 31s | 99.80% | 445.4 MB | 3.3 GB | 370.4 GB | 186.2 GB |
assess_ReadQuality_FastQC (1) | 1 | 3h 31m 22s | 99.60% | 444.5 MB | 3.3 GB | 1.2 TB | 1 TB |
2 threads; full run cpu hours = 9.2
name | threads | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|
assess_ReadQuality_FastQC (3) | 2 | 44m 11s | 188.30% | 459.8 MB | 3.9 GB | 370.4 GB | 186.2 GB |
assess_ReadQuality_FastQC (4) | 2 | 44m 11s | 190.20% | 465.7 MB | 3.9 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (5) | 2 | 44m 10s | 189.90% | 465.2 MB | 3.9 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (2) | 2 | 38m 54s | 191.90% | 462 MB | 3.9 GB | 369.5 GB | 185.2 GB |
assess_ReadQuality_FastQC (1) | 2 | 1h 45m 35s | 200.20% | 459.1 MB | 3.9 GB | 1.2 TB | 1 TB |
3 threads; full run cpu hours = 12.2
name | threads | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|
assess_ReadQuality_FastQC (2) | 3 | 40m 48s | 182.00% | 877.1 MB | 4.9 GB | 369.5 GB | 185.2 GB |
assess_ReadQuality_FastQC (3) | 3 | 40m 48s | 181.90% | 885.5 MB | 4.9 GB | 370.4 GB | 186.2 GB |
assess_ReadQuality_FastQC (4) | 3 | 38m 42s | 187.30% | 885.5 MB | 4.9 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (5) | 3 | 38m 43s | 187.50% | 880.5 MB | 5 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (1) | 3 | 1h 24m 36s | 237.80% | 894.7 MB | 5 GB | 1.2 TB | 1 TB |
4 threads; full run cpu hours = 12.4
name | threads | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|
assess_ReadQuality_FastQC (2) | 4 | 22m 33s | 350.10% | 1.1 GB | 5.7 GB | 369.5 GB | 185.2 GB |
assess_ReadQuality_FastQC (3) | 4 | 22m 19s | 358.50% | 905.9 MB | 5.8 GB | 370.4 GB | 186.2 GB |
assess_ReadQuality_FastQC (5) | 4 | 22m 7s | 356.40% | 970.9 MB | 5.7 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (4) | 4 | 22m 10s | 353.60% | 971.2 MB | 5.7 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (1) | 4 | 1h 37m 8s | 233.10% | 1.1 GB | 5.7 GB | 1.2 TB | 1 TB |
6 threads; full run CPU HOURS = 16.7
name | threads | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|
assess_ReadQuality_FastQC (2) | 6 | 21m 36s | 332.00% | 1.5 GB | 6.9 GB | 369.5 GB | 185.2 GB |
assess_ReadQuality_FastQC (3) | 6 | 21m 36s | 319.90% | 1.4 GB | 6.8 GB | 370.4 GB | 186.2 GB |
assess_ReadQuality_FastQC (5) | 6 | 21m 36s | 332.20% | 1.5 GB | 6.9 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (4) | 6 | 21m 36s | 318.70% | 1.5 GB | 6.9 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (1) | 6 | 1h 20m 29s | 263.50% | 1.4 GB | 7 GB | 1.2 TB | 1 TB |
12 threads; full run CPU HOURS = 37.0
name | threads | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|
assess_ReadQuality_FastQC (5) | 12 | 25m 22s | 271.30% | 2.5 GB | 10.4 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (2) | 12 | 25m 22s | 265.00% | 2.3 GB | 10.6 GB | 369.5 GB | 185.2 GB |
assess_ReadQuality_FastQC (4) | 12 | 25m 22s | 265.00% | 2.4 GB | 10.5 GB | 369.6 GB | 185.4 GB |
assess_ReadQuality_FastQC (3) | 12 | 25m 22s | 271.30% | 2.5 GB | 10.4 GB | 370.4 GB | 186.2 GB |
assess_ReadQuality_FastQC (1) | 12 | 1h 23m 44s | 250.30% | 2.1 GB | 10.5 GB | 1.2 TB | 1 TB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-F 0x900 removes secondary and supplementary alignments. It is actually the default but I thought I'd leave it here so there is no question that its being done.
Thanks for the benchmarking results. I wasn't aware of this default. Do we want to include them though (i.e. -f
?) to get a complete result for the RG level FASTQs, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think we do want to include them. They are duplicates of reads that are already represented. Removing them creates the output that directly matches the original FASTQs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorelfitzgibbon (suggestion/question: non-blocking) You've generated lots of benchmarking results and we should post them on GitHub discussions as well. Also, do you typically adjust for caching performance when conducting benchmarking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you typically adjust for caching performance when conducting benchmarking
Is there a recommended way of doing that? I ran almost all of it on nodes that were already running and had space available. They were run at more or less the same time. I could run them all concurrently on a single F72, although I assume even that background will change as jobs start to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching adjustments are hard to control in general; a way around it could to be make the tests totally independent of the cache by copying the input files to /scratch and then run all benchmarking runs using files under scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do some cache independent benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching adjustments are hard to control in general; a way around it could to be make the tests totally independent of the cache by copying the input files to /scratch and then run all benchmarking runs using files under scratch.
Yeah, I was going suggest this. Maybe worth creating a confluence page for general benchmark recommendations? We've discussed pre-caching, etc with OHIA and MSFT a while ago but using /scratch would be the easiest solution.
@yashpatel6 I think this is ready for another review. I will update based on cache-independent benchmarking in the next PR. |
Actually hold off. I need to stop single libraries and readgroups being run multiple times. |
Gotcha, feel free to re-request the review when ready! |
I started a Discussion about goals for integrating sample level QC into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments and suggestions but otherwise looks good:
config/F16.config
Outdated
} | ||
} | ||
} | ||
withName: run_stats_SAMtools_readgroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with maintaining the three sections and using run_statsReadgroups_SAMtools
reference = '/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta' | ||
output_dir = '/path/to/output/directory' | ||
blcds_registered_dataset = false // if you want the output to be registered | ||
save_intermediate_files = true | ||
|
||
// SAMtools stats options | ||
samtools_remove_duplicates = false | ||
samtools_stats_additional_options = '' | ||
stats_max_rgs_per_sample = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Do we also want to add the max libs parameter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yashpatel6 I left it off in the interest of streamlining the template since library numbers that are large enough to be an issue are currently unlikely. It is in the README. Do you want me to add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, I would suggest adding it
done and tested: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good technically! The specific usage in terms of failing/passing QC we can discuss in the metapipeline context
Description
samples
,libraries
andreadgroups
.SAMtools stats
now runs three times,by sample
,by library
andby readgroup
, unless there is only a single library or readgroup.FastQC
runs at a single level set by input parameterfastqc_level
CollectWgsMetrics
andQualimap
are functionally unchanged.HG003 downsampled
toNFTest
algorithms
standardized toalgorithm
.Closes #42 #44 #45 #50 #63 #66 #67
Testing Results
HG003_0.05x-selected-readgroups
=readgroups
(3libraries
) selected out 1005 totalreadgroups
(-> 11MB BAM 33,400 reads).1
sample (HG003) with...library1
: 1readgroup
library2
: 2readgroups
library3
: 3readgroups
A_mini
2
samples, each with one library and one readgrouphg003-all-tools
, sample levelFastQC
a_mini-all-tools
a_mini-all-tools-multiple-samples
/hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup/log-nftest-20240626T210242Z.log
hg003-fastqc
, readgroup levelFastQC
a_mini-fastqc
, sample levelFastQC
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline using NFTest, or I have justified why I did not need to run NFTest above.