You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Slurm can track disk usage as a consumable resource, but it only checks for available space before launching a task. This is problematic — if there are 100 GBs remaining on the NFS disk and 20 tasks launching concurrently each require 10 GB, the NFS disk will ultimately fill, since for each task, Slurm will see 100 GB free at launch, and will not continuously monitor each task's disk consumption.
Qing had the idea of using fallocate to reserve the full amount of space each job will use ahead of time (as specified by the user). We would then iteratively shrink the file created by fallocate by the amount the job's workspace directory grows in size.
If the user underestimated the total amount of space used by a job, the job should be killed. Because the monitoring process is backgrounded, we would need to trap a signal sent from it.
Because the overhead of monitoring disk usage can be potentially high (e.g., a workspace directory with many subfolders or many little files), this should be disabled by default and only get activated if the user explicitly requests disk space as a consumable resource. This also means that we should set the default disk CRES in Slurm to 0. Finally, we should caution users that they should only reserve disk space for tasks that have nontrivial output sizes (e.g., localization, aligners, etc.)
The text was updated successfully, but these errors were encountered:
I think it would be much easier to run a daemon that monitors how full the NFS is, and increase its size as it gets full. The wasted space would be much cheaper than the amount of developer time it would take to implement something this complicated.
This would be especially easy to implement now because the NFS is served from the controller.
Slurm can track disk usage as a consumable resource, but it only checks for available space before launching a task. This is problematic — if there are 100 GBs remaining on the NFS disk and 20 tasks launching concurrently each require 10 GB, the NFS disk will ultimately fill, since for each task, Slurm will see 100 GB free at launch, and will not continuously monitor each task's disk consumption.
Qing had the idea of using
fallocate
to reserve the full amount of space each job will use ahead of time (as specified by the user). We would then iteratively shrink the file created byfallocate
by the amount the job'sworkspace
directory grows in size.If the user underestimated the total amount of space used by a job, the job should be killed. Because the monitoring process is backgrounded, we would need to trap a signal sent from it.
Because the overhead of monitoring disk usage can be potentially high (e.g., a workspace directory with many subfolders or many little files), this should be disabled by default and only get activated if the user explicitly requests disk space as a consumable resource. This also means that we should set the default disk CRES in Slurm to 0. Finally, we should caution users that they should only reserve disk space for tasks that have nontrivial output sizes (e.g., localization, aligners, etc.)
The text was updated successfully, but these errors were encountered: