-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DASK Deployment using SLURM with GPUs #1381
Comments
Could you please report the output of print_affinity.pyimport pynvml
from dask_cuda.utils import get_cpu_affinity
pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
cpu_affinity = get_cpu_affinity(i)
print(type(get_cpu_affinity(i)), get_cpu_affinity(i)) |
Hi @pentschev , I have forgotted to mention that I have disabled the "os.sched_setaffinity(0, self.cores)", as attached below
|
Keep in mind doing that will likely result in degraded performance. Here's a previous comment I wrote about this on a similar issue. |
Thank you @pentschev for the reply on me disabling the os.sched_setaffinity. I probably need some time to report the output of Regarding the "print_affinity.py": |
Hi @pentschev, Here are the reports of nvidia-smi topo -m output
print_affinity.py output
|
@AquifersBSIM can you clarify what you mean by "I have not enabled the os.sched_setaffinity"? Do you mean that when you ran the above you had the line commented out as in your previous #1381 (comment)? If so, that doesn't really matter for the experiment above. In any case, that unfortunately didn't really clarify whether the failure was in obtaining the CPU affinity or something else happened. Would you please run the following modified version of the script on the compute node? print_affinity2.py
Furthermore, the output of |
Hello @pentschev, regarding the "os.sched_setaffinity", I had the line commented out. Regarding the do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? question. I am sure I am just getting a partition of the node. Information from
Information from
|
So if you're getting only a partition of the node, does that mean you don't have access to all the CPU cores as well? That could be the reason why properly determining the CPU affinity fails, and TBH I have no experience with that sort of partitioning and don't know if that is indeed supported by NVML either. If you know details, can you provide more information about the CPU status, e.g., how many physical CPUs (i.e., sockets) are there, how many cores you actually see with |
Hi @pentschev, FWIW, here is the information that I have gotten from my admin This is regarding why the CPU affinity fails most likely dask doesn't understand cgroups, which are used extensively in HPC, so it's trying to bind processes to the wrong cores. so affinity fails. affinity is VERY difficult to do correctly with modern NUMA and chiplets and cgroups and PCIe irq affinities and everything else I believe this would be an explanation to the topology of the system/cluster affinity tries to lock a task to a core (or set of cores) and not let the kernel move around. the idea is to keep a task right next to specific hardware, like a gpu or ram, so that it runs marginally faster. slurmd+cgroups give the job a fixed set of eg. 8 cores - whatever your job requests. |
Thanks for the details @AquifersBSIM , this is indeed helpful. You are partly right, Dask does not know anything about cgroup, nor should it (I think), all the handling is done via NVML. I inquired with the NVML team and it is not clear yet but it could be a bug. I've been asked to get more details from you so we can confirm this. Could you help answer the following questions?
|
Hi @pentschev, these are my answer to the questions:
|
Thanks @AquifersBSIM for the information. We have tried to reproduce this on our end with cgroup but we have been unsuccessful. To be able to investigate this further we need to reproduce the issue on our end, could you please confirm also the following?
|
Hello @pentschev, Thanks for the question and your help. I think I fixed the issue by requesting for the whole node. Have a look at the following output: Allow me to send a new easier script to run import os
import dask.array as da
from dask.distributed import Client
import time
from contextlib import contextmanager
from distributed.scheduler import logger
import socket
from dask_cuda import LocalCUDACluster
@contextmanager
def timed(txt):
t0 = time.time()
yield
t1 = time.time()
print("%32s time: %8.5f" % (txt, t1 - t0))
def example_function():
print(f"start example")
x = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
y = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1, 2)).compute()
print(z)
if __name__ == "__main__":
cluster = LocalCUDACluster()
client = Client(cluster)
with timed("test"):
example_function() And this is my .sh
This is the output and traceback, correct me if I am wrong, I think dask has worked, because the calculation actually started?
Information from
|
Thank you @AquifersBSIM , I appreciate the additional information, and I agree the affinity looks closer to what we expect, that now means you have 8 CPUs for each GPU and those probably match how your cluster admin partitioned the CPUs/GPUs. Can you clarify if the only changes you've done from the initial report are the
On your latest message you reported:
Is that all or did you have other changes? I would also appreciate if you could share as much of the information as I requested previously in #1381 (comment) too, that can be valuable for us in identifying behavioral difference and also to provide better instructions of setting up partitioning to match proper affinity, which seems like something our documentation is currently lacking. |
Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:
Minimal Complete Verifiable Example:
In addition to that, I have this submission script
Error Message
Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it
Environment:
The text was updated successfully, but these errors were encountered: