-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang in MPI_Init with unbalanced ranks #222
Comments
-N might be flipping where the unused core is located. Example: 2 nodes with 4 cores each
It might be worth doing a little audit here to see if anything stands out with these layouts in mind. |
I think @garlick meant to put this comment here:
I wonder if the two jobs have the same R? I'll try to reproduce this. |
yes sorry! |
Hm, this is interesting (did we know this and just forgot?) $ flux run -N2 -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 1727383096.7284338, "expiration": 0.0, "nodelist": ["corona[82,82]"]}} The $ flux run -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0-3"}}, {"rank": "1", "children": {"core": "0-2"}}], "starttime": 1727383280.7969263, "expiration": 0.0, "nodelist": ["corona[82,82]"]}} This seems to be explicit in the jobspec created by the first case: $ flux run -N2 -n7 --dry-run hostname | jq .resources
[
{
"type": "node",
"count": 2,
"with": [
{
"type": "slot",
"count": 4,
"with": [
{
"type": "core",
"count": 1
}
],
"label": "task"
}
]
}
] There is even a comment in the code: if num_nodes is not None:
num_slots = int(math.ceil(num_tasks / float(num_nodes)))
if num_tasks % num_nodes != 0:
# N.B. uneven distribution results in wasted task slots
task_count_dict = {"total": num_tasks}
else:
task_count_dict = {"per_slot": 1}
slot = cls._create_slot("task", num_slots, children)
resource_section = cls._create_resource(
"node", num_nodes, [slot], exclusive
) |
Anyway, maybe the extra task slot is confusing the taskmap stuff into running the wrong number of tasks on one of the nodes? |
I think the taskmaps are actually correct and I was confused. Fluxion is packing 4 ranks onto the first node in both cases, and 3 on the second, but for some reason when -N is specified, the order of nodes is reversed. |
FWIW I ran the two cases dumping the apinfo using The apinfo comes out the same for both jobs (on both nodes). The environment seems to only differ in the expected ways. I did notice that slurm is now up to version 5 of the apinfo struct, while we are on version 0. Also slurm sets |
slurm also sets several PMI variables that we don't set. I assumed these would be set by cray's libpmi if at all since they are normally for the benefit of MPI, but since we're seeing a problem, maybe worth noting: I took the extra step of adding a |
Looks like cray pmi prints debug on stderr when In @ryanday36's failing case above, the first rank on the second node (96) correctly identifies the address and port that the PMI rank 0 node is listening on, apparently successfully connects, and sends a barrier request there on behalf of its other local ranks. The rank 0 node never sees the connection. Where did it connect? In the good case, the connection is logged and a barrier release packet is returned and things proceed. Sanitized failing log of PE_0 and PE_96 with some noise removed
Sanitized log of PE_0 and PE_95 (1st rank on second node) for good case:
|
FWIW, this does not require MPI to reproduce. This example reproduces the hang with logging using a client that only uses PMI:
|
I tried setting |
I found a bug in how the plugin writes to the apinfo file. If you do a flux alloc with N ranks and then a flux run with fewer ranks, the plugin writes to the apinfo file that there are N ranks. This causes PMI to hang on initialization as it's waiting for all N ranks to show up. I think the bug is in cray_pals.c:create_apinfo() but am not confident of the fix. |
Oh nice @RaymodMichael! Any chance that you could come up with a regression test for I looked at the plugin in depth early on and didn't spot a problem. But I probably missed something! |
I need to correct my previous statement. We're seeing the problem when Flux is run under Slurm. srun -N 2 -n 6 --pty flux start I'm not sure how you would use the test harness to mimic this situation, but you would do something like: flux run -n 5 ${PYTHON:-python3} ${SHARNESS_TEST_SRCDIR}/scripts/apinfo_checker.py | sed -n "s/^1: //p") && |
Ok, I will go try this out on some of our Slurm clusters and try to reproduce. I wonder though if it has something to do with the fact that you are running flux with multiple brokers per node. I wonder if you ran (to modify your example) This commit is relevant. |
I could be mistaken but I wouldn't think this would be related to how the flux instance is started. It more likely has to do with how the MPI job's resources are translated to apinfo. |
Just checking: is the slurm case a cray system where flux is being run without the cray-pals plugin? IOW is cray PMI finding the apinfo file put there by slurm for the allocation? |
Yeah, I was wondering if maybe having duplicate entries in the apinfo structures was somehow confusing it. I wasn't able to reproduce. I got the following, which looks correct:
|
It looks like you're right. I'm dealing with an incomplete local installation of Flux and was using the apinfo file from Slurm. I'll work on getting that fixed. |
Can I get a copy of an apinfo file from the hanging case that started this issue? Feel free to email it to me at [email protected]. |
I can't reproduce this anymore on I wonder what has changed. Any thoughts @ryanday36? |
Running
I don't know what version we had when it was not working. Someone should confirm my results though. I wonder if I'm doing something wrong in my attempt to repro. |
Just as a data point I tried walking through all the cray pmi2 versions in
with e.g.
to no avail. |
Ugh, sorry for the noise. I was doing it wrong. The correct reproducer is
Let me get that |
Here's the json decoded version |
The binary - decode with |
one thing I am noticing is that although the apinfo json dump from rank 0 and rank 190 are the same, the
|
Just to make sure, these are the two apinfo files from the two different nodes? You don't have a separate apinfo file for each rank do you? It wouldn't cause the hang, but that would be something to fix. |
Just one per node. Sorry, the rank suffix was misleading. |
If the data fields are the same between the two files, then it's probably a difference of the white space between fields. I wouldn't mind checking the other file, but the bug is probably not in the apinfo file handling. |
Yeah it happens every time for the reported test case. I'm fine trying a debug libpmi if that helps, although this might go quicker if you can reproduce it? The way to reporoduce on nodes like elcap/tuo while selecting the pmi library and leaving MPICH out of it is (in a 2 node allocation) e.g.
|
We resolved some issues and I was able to install the flux-coral2 package on a Slurm system. I'm running into issues with nodes[] in the apinfo file. If I do: srun -N 2 -n 4 flux/bin/flux start flux/bin/flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq then the apinfo file contains 4 hosts, 2 for each actual one. If I only launch one "flux start" on each of the 2 hosts, then flux run won't let me use more than 2 ranks. Playing around with --exclusive and --ntasks-per-node doesn't help. |
If you do this, what do you see?
If you see the resources of only one node, then maybe the two flux brokers are not bootstrapping and you need
|
|
Oh maybe you need
I think that requests that slurm give you all the cores on a node, in configurations where slurm would normally let jobs share nodes. |
--exclusive gives you the whole node, but we run into the same problems with the apinfo file. srun -N2 -n 2 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq srun -N2 -n 4 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq |
I found that if I add "-c " to the srun, then it gives each flux broker N cores to use on each node. Now if I do srun -N2 -c 2 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq it runs but lists 4 hosts in the apinfo file, and all of them are the same physical host. |
Assuming the node has more that four cores, wouldnt the flux run be expected to use only one node? Might be good to see what flux resource list shows in your flux instance now |
Running with one node is fine. What's broken is that the apinfo file's nodes[] has one entry for each process running on the node. For example
|
Oh yeah, Do we have more than one broker per node again? |
No. I'm running |
srun --mpi=pmi2 -N2 -c 2 --exclusive flux/bin/flux start flux resource list
|
If the NODELIST contains duplicates, that means there is more than one broker per node. |
I think -c means "cores per task" so I would think that would run 1/2 as many brokers per node as there are cores? Edit: my slurm-fu is weak. Might need to call in a lifeline. @grondo? @ryanday36? |
Adding --ntasks-per-node=1 and playing with -c seems to have solved it. There's still some oddness, but I can make progress. |
Sorry that was more painful than expected! |
Slurm is very different at different sites. This might provide some ideas for enhancing the doc/guide/start.rst file. |
Yeah, notice all the WARNINGs in the srun(1) docs for
|
When the launch is hung, are you able to ssh into the two nodes and do a lsof on the lowest rank on each node? That could help us understand what's going on with the sockets. |
OK, in a 2 node allocation:
Then for each node, the lowest number pid for the job tasks, which should be the lowest rank
Edit: I was just confirming the task ranks when my allocation timed out but I did get the first one
|
That's crazy. The second host is supposed to the connect to the first host. I could be missing something, but it looks like it connected to itself. |
Something looks off here. This is repeat of the same test, again 191 tasks over 2 nodes. R looks like this: {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "5-6",
"children": {
"core": "0-95",
"gpu": "0-3"
}
}
],
"nodelist": [
"tuolumne[1005-1006]"
],
"properties": {
"pall": "5-6",
"pdebug": "5-6"
},
"starttime": 1737987848,
"expiration": 1737989648
}
} The apinfo shows "nodes": [
{
"id": 0,
"hostname": "tuolumne1005"
},
{
"id": 1,
"hostname": "tuolumne1006"
}
], and to summarize the rest of the apinfo, ranks 0-95 are on nodeidx 0 (tuo1005) and 0-94) are on nodeidx 1 (tuo1006). But visiting tuo1005 and looking in the environment of the lowest pid I see
I would have expected to see rank 96 on tuo1006. Weirdly the shell seems to have put it on tuo1005, and that is not consistent with what's in the
Edit: adding taskmaps for the above hostname runs (in order):
|
Correct me if I'm wrong @RaymondMichael but it seems like this could explain the lsof observations since rank 96 is being told by the apinfo file that it's on the other node, so it connects to itself. I've opened a flux-core bug on this - it looks like it must be our bug! |
Yah, that would do it. I'm surprised that PMI didn't run into other issues with this before it got to the network code. I'll see what I can do to add code to check for this situation. |
OK, I confirmed that the PMI hang does not reproduce with @grondo's fix applied from I think we can close this. Sorry for the wild goose chase @RaymondMichael - and really appreciate your help. |
No worries, we're here to help. This gave me a chance to re-learn the involved code and clean some things up in the process. |
as described in https://rzlc.llnl.gov/jira/browse/ELCAP-705:
(these are all run with
-o mpibind=off
, fwiw)in a two node allocation (
flux alloc -N2
), runningflux run -n190 ...
puts 96 tasks on one node and 94 on the other and hangs until I ctrl-c.If I run with
flux run -N2 -n190 ...
flux puts 95 tasks on each node and things run fine (if slowly).If I use flux's pmi2 (
-o pmi=pmi2
instead of whatever cray mpi is using by default, the original case runs fine.I did some good old fashioned printf debugging, and it looks like the hang is in MPI_Init, but I haven't gotten any deeper than that. I suspect that this is a an HPE issue, but I'm opening it here too in case you all have any insight. The bit that seems extra confusing is that
flux run -n191 ...
hangs, butflux run -N2 -n191 ...
doesn't. Both of those should have 96 tasks on one node and 95 on the other, so that doesn't fit super well with my characterization of this as an issue with unbalanced ranks / node.The text was updated successfully, but these errors were encountered: