Delete pre-computed, downloaded input files on workflow completion on AWS Batch #961

tsibley · 2022-06-10T22:28:45Z

During some troubleshooting on Slack re: a fork/derivative of this workflow, I realized that the workflow doesn't do anything to exclude the pre-computed input files it downloads at the start from being re-uploaded to S3 at the end of the AWS Batch jobs we use in production. For example, from the logs of 90f43cc5-da33-44a5-b0e6-9fda03ac0806 printed by nextstrain build (with some non-standard timestamping added):

[batch] [2022-06-09T19:30:16.001000]   adding: nextstrain-data/ (stored 0%)
[batch] [2022-06-09T19:30:16.001000]   adding: nextstrain-data/files/ (stored 0%)
[batch] [2022-06-09T19:30:16.002000]   adding: nextstrain-data/files/ncov/ (stored 0%)
[batch] [2022-06-09T19:30:16.002000]   adding: nextstrain-data/files/ncov/open/ (stored 0%)
[batch] [2022-06-09T19:30:28.670000]   adding: nextstrain-data/files/ncov/open/metadata.tsv.gz (deflated 0%)
[batch] [2022-06-09T19:30:53.306000]   adding: nextstrain-data/files/ncov/open/sequences.fasta.xz (deflated 0%)

While zip seems to skip trying to compress those files further and not waste CPU time (they add only ~40s to the zip total), the files still bloat the size of the uploaded archive by quite a bit. While it's theoretically nice to have the exact inputs preserved with the exact outputs so we could track detailed provenance or troubleshoot by exact replication, in practice I'm not sure we need to do either of those things.

Any file deletion solution below will want to condition on running in the context of an a) internal Nextstrain profile and b) AWS Batch. These can be maybe best accomplished by introducing a new config var (e.g. delete_inputs_after_use or some better name) that defaults to disabled but that we enable for our production runs. We could also detect AWS Batch by looking for the presence of an env var (e.g. NEXTSTRAIN_AWS_BATCH_WORKDIR_URL would work currently), but I think a single config var to opt-in to the behaviour is better.

For actually doing the deletion, I see two good solutions for the short term:

Delete these files (conditionally) from within both an onsuccess and onerror handler. This is maybe most obvious.
Mark the files (conditionally) as temp(…) so Snakemake automatically cleans them up when no further rules need them. This is maybe easiest/least extra code, but it's not clear if in practice there are pitfalls with temp() or if it's even supported with Snakemake remote files, which we use to download/materialize/localize files.

Longer term, I think it might be reasonable for the AWS Batch machinery in nextstrain/cli and nextstrain/docker-base to support some sort of ignores file, but there's a bit more to consider there in terms of the right interface and so maybe we'll always want to leave it up to the workflow to handle.

The text was updated successfully, but these errors were encountered:

nextstrain-bot added this to Nextstrain planning (archived) Jun 11, 2022

nextstrain-bot moved this to New in Nextstrain planning (archived) Jun 11, 2022

victorlin moved this from New to Backlog in Nextstrain planning (archived) Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete pre-computed, downloaded input files on workflow completion on AWS Batch #961

Delete pre-computed, downloaded input files on workflow completion on AWS Batch #961

tsibley commented Jun 10, 2022

Delete pre-computed, downloaded input files on workflow completion on AWS Batch #961

Delete pre-computed, downloaded input files on workflow completion on AWS Batch #961

Comments

tsibley commented Jun 10, 2022