You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A lot of our wdl tasks have been failing due to insufficient resources when dealing with data that is not within our usual expected size range.
Its been proposed (thanks Scott) that rather than dialling up all the resources requested by wdl tasks, a more "scientific" approach be used where we check the size of the input before running a task on that input. The resource request for that task can then be set appropriately.
For example, there was a recent failure in the VcfMerge task which takes 2 vcf files and merges them together.
The wdl task that calls this command is setup to use 20GB of memory.
Vcf files from WGS typically have around 5 million records in them. A recent run against some NIH data resulted in vcf files containing over 12 million reads. 20GB of memory was not enough for this dataset and the job ran out of memory.
If there was a means of checking the size of the vcf files before they are merged, the resources for the merge wdl task could be adjusted appropriately.
Describe the solution you'd like
I believe that there is a way within wdl of getting line counts / file size. Would then need to come up with some cutoffs and resource parameters
Describe alternatives you've considered
Bump all wdl task resources when running with large datasets.
Additional context
??
The text was updated successfully, but these errors were encountered:
I first discussed this with Ross a couple of months ago when the NIH bams hit, and I like the idea in principle. Bespoke resource requests per-task based on measuring inputs would be ideal, but because the runtime block is parsed before the command you can't do it in-task — you'd have to create a separate custom task that measured input sizes to create the appropriate resources to give to the following task that actually did the work.
I also considered exactly the alternative you suggest above - have a single float-valued "scaling" parameter in the workflow, defaulting to 1, passed to all tasks which would multiply their mem & walltime requests. It'd be less accurate, but simpler.
I'm happy to take this on with the caveat that its a learning task and that I have one or two things that I'm working on that will have to be higher priority, that said I should have plenty of time between tests to tinker with this so I hopefully it shouldn't take too long.
A lot of our wdl tasks have been failing due to insufficient resources when dealing with data that is not within our usual expected size range.
Its been proposed (thanks Scott) that rather than dialling up all the resources requested by wdl tasks, a more "scientific" approach be used where we check the size of the input before running a task on that input. The resource request for that task can then be set appropriately.
For example, there was a recent failure in the VcfMerge task which takes 2 vcf files and merges them together.
The wdl task that calls this command is setup to use 20GB of memory.
Vcf files from WGS typically have around 5 million records in them. A recent run against some NIH data resulted in vcf files containing over 12 million reads. 20GB of memory was not enough for this dataset and the job ran out of memory.
If there was a means of checking the size of the vcf files before they are merged, the resources for the merge wdl task could be adjusted appropriately.
Describe the solution you'd like
I believe that there is a way within wdl of getting line counts / file size. Would then need to come up with some cutoffs and resource parameters
Describe alternatives you've considered
Bump all wdl task resources when running with large datasets.
Additional context
??
The text was updated successfully, but these errors were encountered: