-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-node parallelism on slurm clusters #290
Comments
I am encountering an issue where, even though I specify nodelist=node41,node42 and allocate 2 nodes, the task ranks do not seem to be shared across the nodes. Instead, each node appears to execute the same rank tasks. Below are the logs:
This results in an inability to achieve true parallel computation across nodes. |
Hi, do you know if your cluster is configured to always run 1 task per node? Ideally it should be possible for different tasks to share cpu resources on the same node and not always take a node exclusively |
can you show the config you are using? |
here #292 |
Hi, |
BTW, I find another way to do multi-node parallelism. I turned to use Do you think it is a good way? |
Hi,
Let's say, I have a slurm cluster that contains 100 nodes, each node has 100 cores. Assuming I have 10000 tasks.
This is my current code:
I find that the
workers
is the number of nodes instead of the number of CPU cores. Is my understanding correct?Then, it seems to me that my 10000 tasks will be executed node by node. For example, the node_1 will process task_1 and then task_101, then task_201. It does not fully utilize all the CPU cores in a single node. I expect all the tasks (task_1, task_101, task_201 ...) to be assigned to node_1 at the beginning and executed parallelly.
If I want to distribute my jobs to each node and fully utilize every core, what should I change to my code?
Thanks!
The text was updated successfully, but these errors were encountered: