Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fallocate to enforce disk space #42

Open
julianhess opened this issue Mar 20, 2020 · 1 comment
Open

Use fallocate to enforce disk space #42

julianhess opened this issue Mar 20, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@julianhess
Copy link
Collaborator

Slurm can track disk usage as a consumable resource, but it only checks for available space before launching a task. This is problematic — if there are 100 GBs remaining on the NFS disk and 20 tasks launching concurrently each require 10 GB, the NFS disk will ultimately fill, since for each task, Slurm will see 100 GB free at launch, and will not continuously monitor each task's disk consumption.

Qing had the idea of using fallocate to reserve the full amount of space each job will use ahead of time (as specified by the user). We would then iteratively shrink the file created by fallocate by the amount the job's workspace directory grows in size.

If the user underestimated the total amount of space used by a job, the job should be killed. Because the monitoring process is backgrounded, we would need to trap a signal sent from it.

Because the overhead of monitoring disk usage can be potentially high (e.g., a workspace directory with many subfolders or many little files), this should be disabled by default and only get activated if the user explicitly requests disk space as a consumable resource. This also means that we should set the default disk CRES in Slurm to 0. Finally, we should caution users that they should only reserve disk space for tasks that have nontrivial output sizes (e.g., localization, aligners, etc.)

@julianhess julianhess added enhancement New feature or request Triage New issues which haven't been assigned to a project and need attention labels Mar 20, 2020
@agraubert agraubert removed the Triage New issues which haven't been assigned to a project and need attention label Apr 2, 2020
@julianhess
Copy link
Collaborator Author

I think it would be much easier to run a daemon that monitors how full the NFS is, and increase its size as it gets full. The wasted space would be much cheaper than the amount of developer time it would take to implement something this complicated.

This would be especially easy to implement now because the NFS is served from the controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants