Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Azure storage requirements beyond tempdisk for standing worker result in NotImplementedError #14522

Open
jeremiahwander opened this issue May 3, 2024 · 1 comment
Assignees

Comments

@jeremiahwander
Copy link

What happened?

Note: this is an Azure-specific issue.

When submitting a batch/job that requests more storage than is available on the temp disk of any standing worker, but doesn't request a specific number of cores or amount of memory, a NotImplementedError is raised in batch/cloud/azure/worker/disk.py.

See this Batch record for an example of the issue in action: https://batch.azure.hail.is/batches/4563654/jobs/1. The corresponding base case to reproduce this is:

import hailtop.batch as hb
backend = hb.ServiceBackend(billing_project="<YOUR BILLING PROJECT>")
b = hb.Batch(backend=backend, name="storage_test")
j = b.new_job()
j.image("ubuntu:20.04")
j.storage("700GiB")
j.command("df -h")
b.run(wait=False)

On the cluster azure.hail.is this job gets scheduled on a Standard_D16ds_v4 instance which has a 600 GiB temp disk.

On GCP, when requests exceed this amount a data disk is provisioned to service the request. While this is feasible on Azure and could be implemented, it may not be the recommended solution as temp disks are much better suited to ephemeral workloads than data disks.

On clusters with a smaller standing worker (i.e. fewer cores) there is a workaround, which also possibly suggests a reasonable partial solution. This workaround is to specify a required number of cores that forces a larger VM of the same family to be provisioned. This makes a larger temp disk available for the job to leverage. The corresponding partial solution would be to take knowledge of the temp disk size for any VM into account when scheduling jobs and provision larger VMs when warranted by the storage requirement of a job.

Based on current limitations for VM core count (16) this suggests a ceiling on storage that can be allocated to any job in Azure of 600 GiB. At that point it would be necessary to allocate a data disk.

This issue reproduces on both azure.hail.is and our own Azure cluster.

Version

0.2.126-cdd2c132bfa2

Relevant log output

No response

@jeremiahwander jeremiahwander added the needs-triage A brand new issue that needs triaging. label May 3, 2024
@daniel-goldstein daniel-goldstein removed the needs-triage A brand new issue that needs triaging. label May 20, 2024
@daniel-goldstein
Copy link
Contributor

For posterity, the proposed mitigation is to promote a job request to a number of cores necessary to get the requested storage, much in the same way that we do for memory. If that requires more cores than can be allocated in our shared pools, we reject the request. In such a scenario, the user should alter their job to use a job-private instance with sufficient disk space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants