You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During some troubleshooting on Slack re: a fork/derivative of this workflow, I realized that the workflow doesn't do anything to exclude the pre-computed input files it downloads at the start from being re-uploaded to S3 at the end of the AWS Batch jobs we use in production. For example, from the logs of 90f43cc5-da33-44a5-b0e6-9fda03ac0806 printed by nextstrain build (with some non-standard timestamping added):
While zip seems to skip trying to compress those files further and not waste CPU time (they add only ~40s to the zip total), the files still bloat the size of the uploaded archive by quite a bit. While it's theoretically nice to have the exact inputs preserved with the exact outputs so we could track detailed provenance or troubleshoot by exact replication, in practice I'm not sure we need to do either of those things.
Any file deletion solution below will want to condition on running in the context of an a) internal Nextstrain profile and b) AWS Batch. These can be maybe best accomplished by introducing a new config var (e.g. delete_inputs_after_use or some better name) that defaults to disabled but that we enable for our production runs. We could also detect AWS Batch by looking for the presence of an env var (e.g. NEXTSTRAIN_AWS_BATCH_WORKDIR_URL would work currently), but I think a single config var to opt-in to the behaviour is better.
For actually doing the deletion, I see two good solutions for the short term:
Mark the files (conditionally) as temp(…) so Snakemake automatically cleans them up when no further rules need them. This is maybe easiest/least extra code, but it's not clear if in practice there are pitfalls with temp() or if it's even supported with Snakemake remote files, which we use to download/materialize/localize files.
Longer term, I think it might be reasonable for the AWS Batch machinery in nextstrain/cli and nextstrain/docker-base to support some sort of ignores file, but there's a bit more to consider there in terms of the right interface and so maybe we'll always want to leave it up to the workflow to handle.
The text was updated successfully, but these errors were encountered:
During some troubleshooting on Slack re: a fork/derivative of this workflow, I realized that the workflow doesn't do anything to exclude the pre-computed input files it downloads at the start from being re-uploaded to S3 at the end of the AWS Batch jobs we use in production. For example, from the logs of
90f43cc5-da33-44a5-b0e6-9fda03ac0806
printed bynextstrain build
(with some non-standard timestamping added):While
zip
seems to skip trying to compress those files further and not waste CPU time (they add only ~40s to thezip
total), the files still bloat the size of the uploaded archive by quite a bit. While it's theoretically nice to have the exact inputs preserved with the exact outputs so we could track detailed provenance or troubleshoot by exact replication, in practice I'm not sure we need to do either of those things.Any file deletion solution below will want to condition on running in the context of an a) internal Nextstrain profile and b) AWS Batch. These can be maybe best accomplished by introducing a new config var (e.g.
delete_inputs_after_use
or some better name) that defaults to disabled but that we enable for our production runs. We could also detect AWS Batch by looking for the presence of an env var (e.g.NEXTSTRAIN_AWS_BATCH_WORKDIR_URL
would work currently), but I think a single config var to opt-in to the behaviour is better.For actually doing the deletion, I see two good solutions for the short term:
Delete these files (conditionally) from within both an
onsuccess
andonerror
handler. This is maybe most obvious.Mark the files (conditionally) as
temp(…)
so Snakemake automatically cleans them up when no further rules need them. This is maybe easiest/least extra code, but it's not clear if in practice there are pitfalls withtemp()
or if it's even supported with Snakemake remote files, which we use to download/materialize/localize files.Longer term, I think it might be reasonable for the AWS Batch machinery in nextstrain/cli and nextstrain/docker-base to support some sort of ignores file, but there's a bit more to consider there in terms of the right interface and so maybe we'll always want to leave it up to the workflow to handle.
The text was updated successfully, but these errors were encountered: