hang in MPI_Init with unbalanced ranks #222

ryanday36 · 2024-09-26T18:20:47Z

as described in https://rzlc.llnl.gov/jira/browse/ELCAP-705:

(these are all run with -o mpibind=off, fwiw)

in a two node allocation (flux alloc -N2), running flux run -n190 ... puts 96 tasks on one node and 94 on the other and hangs until I ctrl-c.

If I run with flux run -N2 -n190 ... flux puts 95 tasks on each node and things run fine (if slowly).

If I use flux's pmi2 (-o pmi=pmi2 instead of whatever cray mpi is using by default, the original case runs fine.

I did some good old fashioned printf debugging, and it looks like the hang is in MPI_Init, but I haven't gotten any deeper than that. I suspect that this is a an HPE issue, but I'm opening it here too in case you all have any insight. The bit that seems extra confusing is that flux run -n191 ... hangs, but flux run -N2 -n191 ... doesn't. Both of those should have 96 tasks on one node and 95 on the other, so that doesn't fit super well with my characterization of this as an issue with unbalanced ranks / node.

The text was updated successfully, but these errors were encountered:

garlick · 2024-09-26T19:12:34Z

-N might be flipping where the unused core is located. Example: 2 nodes with 4 cores each

ƒ(s=2,d=1) garlick@picl3:~$ flux run -n7 hostname
picl4
picl4
picl4
picl4
picl3
picl3
picl3
ƒ(s=2,d=1) garlick@picl3:~$ flux run -N2 -n7 hostname
picl3
picl3
picl3
picl3
picl4
picl4
picl4

It might be worth doing a little audit here to see if anything stands out with these layouts in mind.

grondo · 2024-09-26T20:29:38Z

I think @garlick meant to put this comment here:

Have to leave right now but one thing that seems wrong is flux job taskmap shows the same map for both of those cases.

[[0,1,4,1],[1,1,3,1]]

That may be a flux-core bug. Will circle back to this later!

I wonder if the two jobs have the same R? I'll try to reproduce this.

garlick · 2024-09-26T20:31:28Z

yes sorry!

grondo · 2024-09-26T20:46:31Z

Hm, this is interesting (did we know this and just forgot?)

$ flux run -N2 -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 1727383096.7284338, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}

The -N2 -n7 case allocates all 4 cores on both ranks, while -n7 alone allocates just the 7 requested cores:

$ flux run -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0-3"}}, {"rank": "1", "children": {"core": "0-2"}}], "starttime": 1727383280.7969263, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}

This seems to be explicit in the jobspec created by the first case:

$ flux run -N2 -n7 --dry-run hostname | jq .resources
[
  {
    "type": "node",
    "count": 2,
    "with": [
      {
        "type": "slot",
        "count": 4,
        "with": [
          {
            "type": "core",
            "count": 1
          }
        ],
        "label": "task"
      }
    ]
  }
]

There is even a comment in the code:

https://github.com/flux-framework/flux-core/blob/e3f293b4bb8f34da55b7e9253da1de18e2f93aef/src/bindings/python/flux/job/Jobspec.py#L886-L891

        if num_nodes is not None:
            num_slots = int(math.ceil(num_tasks / float(num_nodes)))
            if num_tasks % num_nodes != 0:
                # N.B. uneven distribution results in wasted task slots
                task_count_dict = {"total": num_tasks}
            else:
                task_count_dict = {"per_slot": 1}
            slot = cls._create_slot("task", num_slots, children)
            resource_section = cls._create_resource(
                "node", num_nodes, [slot], exclusive
            )

grondo · 2024-09-26T20:47:17Z

Anyway, maybe the extra task slot is confusing the taskmap stuff into running the wrong number of tasks on one of the nodes?

garlick · 2024-09-26T22:00:05Z

I think the taskmaps are actually correct and I was confused. Fluxion is packing 4 ranks onto the first node in both cases, and 3 on the second, but for some reason when -N is specified, the order of nodes is reversed.

garlick · 2024-10-07T15:26:39Z

FWIW I ran the two cases dumping the apinfo using t/scripts/apinfo_checker.py for each rank and also the PMI_ and PALS_ environment variables.

The apinfo comes out the same for both jobs (on both nodes).

The environment seems to only differ in the expected ways.

I did notice that slurm is now up to version 5 of the apinfo struct, while we are on version 0.

Also slurm sets PALS_LOCAL_RANKID in the environment while we do not.

garlick · 2024-10-07T16:04:07Z

slurm also sets several PMI variables that we don't set. I assumed these would be set by cray's libpmi if at all since they are normally for the benefit of MPI, but since we're seeing a problem, maybe worth noting: PMI_JOBID, PMI_LOCAL_RANK, PMI_LOCAL_SIZE, PMI_RANK, PMI_SIZE, PMI_UNIVERSE_SIZE.

I took the extra step of adding a printenv to an MPI hello world program compiled with cray MPI and running that, and I"m not seeing any PMI_ variables other than PMI_CONTROL_PORT and PMI_SHARED_SECRET which we do set.

garlick · 2024-10-07T19:29:04Z

Looks like cray pmi prints debug on stderr when PMI_DEBUG=1 in the environment.

In @ryanday36's failing case above, the first rank on the second node (96) correctly identifies the address and port that the PMI rank 0 node is listening on, apparently successfully connects, and sends a barrier request there on behalf of its other local ranks. The rank 0 node never sees the connection. Where did it connect?

In the good case, the connection is logged and a barrier release packet is returned and things proceed.

Sanitized failing log of PE_0 and PE_96 with some noise removed

Mon Oct  7 10:56:52 2024: [PE_0]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:52 2024: [PE_0]: mmap in a file for shared memory type 0 len 2520448
Mon Oct  7 10:56:52 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832
Mon Oct  7 10:56:52 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 96
Mon Oct  7 10:56:52 2024: [PE_96]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:52 2024: [PE_96]: mmap in a file for shared memory type 0 len 2487680
Mon Oct  7 10:56:52 2024: [PE_96]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832
Mon Oct  7 10:56:52 2024: [PE_96]:  mmap loop pg_info->pes_this_node = 94
Mon Oct  7 10:56:53 2024: [PE_96]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_96]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832.applist
Mon Oct  7 10:56:53 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 96
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: my nid = 0  
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: my nid = 0 
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME  
Mon Oct  7 10:56:53 2024: [PE_0]: input nid  = 0 _pmi_inet_info[0].nid =0  
Mon Oct  7 10:56:53 2024: [PE_0]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:53 2024: [PE_0]:  _pmi_inet_listen_socket_setup num targets = 1 have_controller = 0
Mon Oct  7 10:56:53 2024: [PE_96]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832.applist
Mon Oct  7 10:56:53 2024: [PE_0]: Setting control listener IP to any IP 
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11998
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:53 2024: [PE_0]: PMI rank/pid filename : /var/tmp/garlick/flux-JsJqGs/jobtmp-0-f2yDDY1wm/pmi_attribs
Mon Oct  7 10:56:53 2024: [PE_96]:  mmap loop pg_info->pes_this_node = 94 
Mon Oct  7 10:56:53 2024: [PE_0]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname
Mon Oct  7 10:56:53 2024: [PE_0]: accept_child_conns: waiting on listening
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: my nid = 1 
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: input nid  = 0 _pmi_inet_info[0].nid =0  
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init:  controller_nid = 0 controller hostname REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: my nid = 1 
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init:  controller_nid = 0 controller hostname REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: input nid  = 0 _pmi_inet_info[0].nid =0  
Mon Oct  7 10:56:53 2024: [PE_96]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:53 2024: [PE_96]:  _pmi_inet_listen_socket_setup num targets = 0 have_controller = 1
Mon Oct  7 10:56:53 2024: [PE_96]: Setting control listener IP to any IP
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11998
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:53 2024: [PE_96]: PMI rank/pid filename : /var/tmp/garlick/flux-uetjMU/jobtmp-1-f2yDDY1wm/pmi_attribs
Mon Oct  7 10:56:53 2024: [PE_96]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_inet_setup: have controller = 1 controller hostname REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: Failed to obtain IP address for REDACTED_NODE0_HOSTNAME-hsn0. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:53 2024: [PE_96]: Failed to obtain IP address for REDACTED_NODE0_HOSTNAME-hsn1. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:53 2024: [PE_96]: find_controller: hostname lookup: using port=11998 sleep=2 retry=300 control-name=REDACTED_NODE0_HOSTNAME contoller-ip=REDACTED_NODE0.IPADDRESS
Mon Oct  7 10:56:53 2024: [PE_96]: conn_to_controller: open controller fd 4
Mon Oct  7 10:56:53 2024: [PE_96]: network_barrier:sending barrier packet to my controller

Sanitized log of PE_0 and PE_95 (1st rank on second node) for good case:

Mon Oct  7 10:56:30 2024: [PE_0]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:30 2024: [PE_0]: mmap in a file for shared memory type 0 len 2504064
Mon Oct  7 10:56:30 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752
Mon Oct  7 10:56:30 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:30 2024: [PE_95]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:30 2024: [PE_95]: mmap in a file for shared memory type 0 len 2504064
Mon Oct  7 10:56:30 2024: [PE_95]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752
Mon Oct  7 10:56:30 2024: [PE_95]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:31 2024: [PE_95]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_95]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_95]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752.applist
Mon Oct  7 10:56:31 2024: [PE_95]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: my nid = 1
Mon Oct  7 10:56:31 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752.applist
Mon Oct  7 10:56:31 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: my nid = 0 
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: my nid = 0 
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_0]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_0]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:31 2024: [PE_0]:  _pmi_inet_listen_socket_setup num targets = 1 have_controller = 0
Mon Oct  7 10:56:31 2024: [PE_0]: Setting control listener IP to any IP
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11999
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:31 2024: [PE_0]: PMI rank/pid filename : /var/tmp/garlick/flux-uetjMU/jobtmp-0-f2oQMVbKd/pmi_attribs
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:31 2024: [PE_0]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: waiting on listening
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:31 2024: [PE_0]: inet_accept_with_address: accepted a connection from READCTED_NODE1.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: inet_recv for accept completion done
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: target_id 1 READCTED_NODE1.IPADDRESS connected
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: open target fd 4 
Mon Oct  7 10:56:31 2024: [PE_0]: PE 0 network_barrier:receiving message from target 0 nid 1
Mon Oct  7 10:56:31 2024: [PE_0]: PE 0 network_barrier:received message from target 0 nid 1 errno 0
Mon Oct  7 10:56:31 2024: [PE_0]: network_barrier:sending release packet to target 0
Mon Oct  7 10:56:31 2024: [PE_0]: calling _pmi_inet_setup (app)
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname 
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: waiting on listening  
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME  
Mon Oct  7 10:56:31 2024: [PE_95]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init:  controller_nid = 0 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: my nid = 1 
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: num_nodes = 2 
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:31 2024: [PE_0]: inet_accept_with_address: accepted a connection from READCTED_NODE1.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: inet_recv for accept completion done
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: target_id 1 READCTED_NODE1.IPADDRESS connected
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: open target fd 5
Mon Oct  7 10:56:31 2024: [PE_0]: completed PMI TCP/IP inet set up
Mon Oct  7 10:56:31 2024: [PE_0]: PMI Version: 6.1.15 git rev: 6
Mon Oct  7 10:56:31 2024: [PE_0]: PMI rank = 0 pg id = 3974472138752 appnum = 0 pes_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_0]: base_pe[0] = 0 pes_in_app[0] = 190 pes_in_app_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_initialized = 0 spawned = 0
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_kvs added (GLOBAL=1): key=universeSize, value=190 (hash=0)
Mon Oct  7 10:56:31 2024: [PE_0]: vector-process-mapping str = [(vector,(0,2,95))]
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_kvs added (GLOBAL=1): key=PMI_process_mapping, value=(vector,(0,2,95)) (hash=0)
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_create_jobattrs - done
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_info_get_jobattr: FOUND a match to name=PMI_process_mapping, (val=(vector,(0,2,95)))
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_nodeattr added : name=-bcast-1-0, value=31353732393037, num_pairs=1
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init:  controller_nid = 0 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_95]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:31 2024: [PE_95]:  _pmi_inet_listen_socket_setup num targets = 0 have_controller = 1
Mon Oct  7 10:56:31 2024: [PE_95]: Setting control listener IP to any IP
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11999
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:31 2024: [PE_95]: PMI rank/pid filename : /var/tmp/garlick/flux-JsJqGs/jobtmp-1-f2oQMVbKd/pmi_attribs
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_nodeattr added : name=xpmem-0, value=1:0, num_pairs=6
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_info_getnodeattr: FOUND a match to name=xpmem-0, (val=1:0)
Mon Oct  7 10:56:31 2024: [PE_95]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_setup: have controller = 1 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn0. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn1. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: find_controller: hostname lookup: using port=11999 sleep=2 retry=300 control-name=READCTED_NODE0_HOSTNAME contoller-ip=REDACTED_NODE0.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_95]: conn_to_controller: open controller fd 4
Mon Oct  7 10:56:31 2024: [PE_95]: network_barrier:sending barrier packet to my controller
Mon Oct  7 10:56:31 2024: [PE_95]: calling _pmi_inet_setup (app)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_setup: have controller = 1 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn0. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn1. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: find_controller: hostname lookup: using port=11999 sleep=2 retry=300 control-name=READCTED_NODE0_HOSTNAME contoller-ip=REDACTED_NODE0.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_95]: conn_to_controller: open controller fd 5
Mon Oct  7 10:56:31 2024: [PE_95]: completed PMI TCP/IP inet set up
Mon Oct  7 10:56:31 2024: [PE_95]: PMI Version: 6.1.15 git rev: 6
Mon Oct  7 10:56:31 2024: [PE_95]: PMI rank = 95 pg id = 3974472138752 appnum = 0 pes_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_95]: base_pe[0] = 0 pes_in_app[0] = 190 pes_in_app_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_initialized = 0 spawned = 0
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_kvs added (GLOBAL=1): key=universeSize, value=190 (hash=0)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_kvs added (GLOBAL=1): key=PMI_process_mapping, value=(vector,(0,2,95)) (hash=0)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_create_jobattrs - done
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_info_get_jobattr: FOUND a match to name=PMI_process_mapping, (val=(vector,(0,2,95)))
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_nodeattr added : name=-bcast-1-95, value=31353732383735, num_pairs=1
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_nodeattr added : name=xpmem-0, value=1:0, num_pairs=2
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_info_getnodeattr: FOUND a match to name=xpmem-0, (val=1:0)
Mon Oct  7 10:56:43 2024: [PE_95]: _pmi_network_allgather: num_targets = 0 have_controller = 1 len_smp 760 len_global 1520
Mon Oct  7 10:56:44 2024: [PE_0]: _pmi_network_allgather: num_targets = 1 have_controller = 0 len_smp 760 len_global 1520
Mon Oct  7 10:56:44 2024: [PE_0]: returning from PMI2_Finalize, _pmi2_initialized 0, _pmi_initialized 0
Mon Oct  7 10:56:44 2024: [PE_95]: returning from PMI2_Finalize, _pmi2_initialized 0, _pmi_initialized 0

garlick · 2024-10-07T19:46:59Z

FWIW, this does not require MPI to reproduce. This example reproduces the hang with logging using a client that only uses PMI:

$ flux run --env=PMI_DEBUG=1 -opmi=cray-pals -n190 -N2 flux pmi barrier
[snip]
Mon Oct  7 12:50:31 2024: [PE_96]: network_barrier:sending barrier packet to my controller

garlick · 2024-10-07T22:50:38Z

I tried setting PALS_LOCAL_RANKID in a modified version of cray_pals.so on the off chance that it might help, but no luck. We might need some help from cray on this one.

RaymondMichael · 2025-01-21T15:47:43Z

I found a bug in how the plugin writes to the apinfo file. If you do a flux alloc with N ranks and then a flux run with fewer ranks, the plugin writes to the apinfo file that there are N ranks. This causes PMI to hang on initialization as it's waiting for all N ranks to show up. I think the bug is in cray_pals.c:create_apinfo() but am not confident of the fix.

garlick · 2025-01-21T16:06:21Z

Oh nice @RaymodMichael! Any chance that you could come up with a regression test for t1001-cray-pals.t that shows where the apinfo goes wrong? The apinfo_checker.py can be used to dump the apinfo file in json form.

I looked at the plugin in depth early on and didn't spot a problem. But I probably missed something!

RaymondMichael · 2025-01-21T17:39:52Z

I need to correct my previous statement. We're seeing the problem when Flux is run under Slurm.

srun -N 2 -n 6 --pty flux start
flux run -n 5 ...

I'm not sure how you would use the test harness to mimic this situation, but you would do something like:

flux run -n 5 ${PYTHON:-python3} ${SHARNESS_TEST_SRCDIR}/scripts/apinfo_checker.py | sed -n "s/^1: //p") &&
echo "$apinfo" | jq -e ".npes == 5"

jameshcorbett · 2025-01-21T18:15:53Z

Ok, I will go try this out on some of our Slurm clusters and try to reproduce. I wonder though if it has something to do with the fact that you are running flux with multiple brokers per node. I wonder if you ran (to modify your example) srun -N 2 -n 2 --pty flux start or srun -N2 --tasks-per-node=1 --pty flux start you would see the same issue?

This commit is relevant.

garlick · 2025-01-21T18:20:52Z

I could be mistaken but I wouldn't think this would be related to how the flux instance is started. It more likely has to do with how the MPI job's resources are translated to apinfo.

garlick · 2025-01-21T18:45:55Z

Just checking: is the slurm case a cray system where flux is being run without the cray-pals plugin? IOW is cray PMI finding the apinfo file put there by slurm for the allocation?

jameshcorbett · 2025-01-21T19:23:18Z

I could be mistaken but I wouldn't this would be related to how the flux instance is started. It more likely has to do with how the MPI job's resources are translated to apinfo.

Yeah, I was wondering if maybe having duplicate entries in the apinfo structures was somehow confusing it.

I wasn't able to reproduce. I got the following, which looks correct:

bash-4.4$ srun -N2 -n6 ~/local/bin/flux start flux run -n5 -opmi=cray-pals t/scripts/apinfo_checker.py | jq .
{
  "version": 1,
  "comm_profiles": [],
  "cmds": [
    {
      "npes": 5,
      "pes_per_node": 5,
      "cpus_per_pe": 1
    }
  ],
  "pes": [
    {
      "localidx": 0,
      "cmdidx": 0,
      "nodeidx": 0
    },
    {
      "localidx": 1,
      "cmdidx": 0,
      "nodeidx": 0
    },
    {
      "localidx": 2,
      "cmdidx": 0,
      "nodeidx": 0
    },
    {
      "localidx": 3,
      "cmdidx": 0,
      "nodeidx": 0
    },
    {
      "localidx": 4,
      "cmdidx": 0,
      "nodeidx": 0
    }
  ],
  "nodes": [
    {
      "id": 0,
      "hostname": "ruby9"
    }
  ],
  "nics": []
}

RaymondMichael · 2025-01-22T15:10:35Z

It looks like you're right. I'm dealing with an incomplete local installation of Flux and was using the apinfo file from Slurm. I'll work on getting that fixed.

RaymondMichael · 2025-01-22T15:12:28Z

Can I get a copy of an apinfo file from the hanging case that started this issue? Feel free to email it to me at [email protected].

garlick · 2025-01-22T15:43:10Z

I can't reproduce this anymore on tuo or elcap using either the original MPI test or the cray-PMI-only test I posted later.

I wonder what has changed. Any thoughts @ryanday36?

garlick · 2025-01-22T16:01:43Z

Running strings on /opt/cray/pe/lib64/libpmi.so.0

PMI VERSION     : version 6.1.15.6
PMI BUILD INFO  : Built Wed Jun 26 15:19:05 UTC 2024 (git hash ba9b927)

I don't know what version we had when it was not working.

Someone should confirm my results though. I wonder if I'm doing something wrong in my attempt to repro.

garlick · 2025-01-22T16:31:10Z

Just as a data point I tried walking through all the cray pmi2 versions in /opt/cray/pe/pmi to see if maybe something was fixed in cray PMI

$ ls  /opt/cray/pe/pmi/
6.0.13	6.0.16	6.1.0  6.1.10  6.1.12  6.1.14	 6.1.15    6.1.2  6.1.4  6.1.6	6.1.9
6.0.15	6.0.17	6.1.1  6.1.11  6.1.13  6.1.14.6  6.1.15.6  6.1.3  6.1.5  6.1.7	default

with e.g.

$ flux run -n190 -N2 -o pmi=cray-pals flux pmi --method=libpmi2:/opt/cray/pe/pmi/6.1.15.6/lib/libpmi.so barrier

to no avail.

garlick · 2025-01-22T16:45:54Z

Ugh, sorry for the noise. I was doing it wrong. The correct reproducer is

flux run -n191 -o pmi=cray-pals flux pmi  barrier

Let me get that apinfo file for you. 👊

garlick · 2025-01-22T16:56:13Z

Here's the json decoded version

apinfo.190.json

garlick · 2025-01-22T16:59:20Z

The binary - decode with base64 -d

apinfo-190-base64.txt

garlick · 2025-01-22T18:04:58Z

one thing I am noticing is that although the apinfo json dump from rank 0 and rank 190 are the same, the apinfo files themselves differ

 $ hexdump -C apinfo.0  >/tmp/z
 $ hexdump -C apinfo.190  >/tmp/y
 $ diff /tmp/z /tmp/y
155,157c155,157
< 000009a0  00 72 ef 23 6f a1 ce e0  00 00 00 00 00 00 00 00  |.r.#o...........|
< 000009b0  00 72 ef 23 6f a1 ce e0  e0 7f ff ff ff 7f 00 00  |.r.#o...........|
< 000009c0  d0 69 77 00 00 00 00 00  30 ad 76 00 01 00 00 00  |.iw.....0.v.....|
---
> 000009a0  00 27 d7 9f 61 07 8c 4c  00 00 00 00 00 00 00 00  |.'..a..L........|
> 000009b0  00 27 d7 9f 61 07 8c 4c  e0 7f ff ff ff 7f 00 00  |.'..a..L........|
> 000009c0  60 59 79 00 00 00 00 00  c0 7e 79 00 01 00 00 00  |`Yy......~y.....|
159,161c159,161
< 000009e0  ff 7f 00 00 00 72 ef 23  6f a1 ce e0 00 00 00 00  |.....r.#o.......|
< 000009f0  00 00 00 00 00 72 ef 23  6f a1 ce e0 e0 7f ff ff  |.....r.#o.......|
< 00000a00  ff 7f 00 00 d0 69 77 00  00 00 00 00 30 ad 76 00  |.....iw.....0.v.|
---
> 000009e0  ff 7f 00 00 00 27 d7 9f  61 07 8c 4c 00 00 00 00  |.....'..a..L....|
> 000009f0  00 00 00 00 00 27 d7 9f  61 07 8c 4c e0 7f ff ff  |.....'..a..L....|
> 00000a00  ff 7f 00 00 60 59 79 00  00 00 00 00 c0 7e 79 00  |....`Yy......~y.|

RaymondMichael · 2025-01-22T18:10:29Z

Just to make sure, these are the two apinfo files from the two different nodes? You don't have a separate apinfo file for each rank do you? It wouldn't cause the hang, but that would be something to fix.

garlick · 2025-01-22T18:18:37Z

Just one per node. Sorry, the rank suffix was misleading.

RaymondMichael · 2025-01-23T17:57:47Z

If the data fields are the same between the two files, then it's probably a difference of the white space between fields. I wouldn't mind checking the other file, but the bug is probably not in the apinfo file handling.
Does the bug happen every time you run with uneven ranks at that size? I'm trying to identify what are the other variables we could check. If I sent you a special debug version of libpmi*.so, would you being willing to run with it and look at the extra debug statements?

garlick · 2025-01-23T19:23:00Z

Yeah it happens every time for the reported test case. I'm fine trying a debug libpmi if that helps, although this might go quicker if you can reproduce it?

The way to reporoduce on nodes like elcap/tuo while selecting the pmi library and leaving MPICH out of it is (in a 2 node allocation) e.g.

$ flux run -n191 -o pmi=cray-pals flux pmi --method=libpmi2:/opt/cray/pe/pmi/6.1.15.6/lib/libpmi.so barrier

RaymondMichael · 2025-01-24T13:31:16Z

We resolved some issues and I was able to install the flux-coral2 package on a Slurm system. I'm running into issues with nodes[] in the apinfo file. If I do:

srun -N 2 -n 4 flux/bin/flux start flux/bin/flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq

then the apinfo file contains 4 hosts, 2 for each actual one. If I only launch one "flux start" on each of the 2 hosts, then flux run won't let me use more than 2 ranks. Playing around with --exclusive and --ntasks-per-node doesn't help.

garlick · 2025-01-24T14:35:07Z

If you do this, what do you see?

srun -N2 flux/bin/flux start flux resource list

If you see the resources of only one node, then maybe the two flux brokers are not bootstrapping and you need

srun -N2 --mpi=pmi2 flux/bin/flux start ...

RaymondMichael · 2025-01-24T14:54:56Z

     STATE NNODES NCORES NGPUS NODELIST
      free      2      2     0 pinoak[0024-0025]
 allocated      0      0     0 
      down      0      0     0

garlick · 2025-01-24T14:58:33Z

Oh maybe you need

srun -N2 --exlcusive ...

I think that requests that slurm give you all the cores on a node, in configurations where slurm would normally let jobs share nodes.

RaymondMichael · 2025-01-24T15:05:34Z

--exclusive gives you the whole node, but we run into the same problems with the apinfo file.

srun -N2 -n 2 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq
0.023s: job.exception ƒZMuKmZ type=alloc severity=0 unsatisfiable request
Jan 24 09:04:22.243976 CST broker.err[0]: rc2.0: flux run -n4 -opmi=cray-pals ./apinfo_checker.py Exited (rc=1) 0.4s

srun -N2 -n 4 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq
has the same 4-host problem.

RaymondMichael · 2025-01-24T15:13:55Z

I found that if I add "-c " to the srun, then it gives each flux broker N cores to use on each node. Now if I do

srun -N2 -c 2 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq

it runs but lists 4 hosts in the apinfo file, and all of them are the same physical host.

garlick · 2025-01-24T15:48:43Z

Assuming the node has more that four cores, wouldnt the flux run be expected to use only one node?

Might be good to see what flux resource list shows in your flux instance now

RaymondMichael · 2025-01-24T15:55:54Z

Running with one node is fine. What's broken is that the apinfo file's nodes[] has one entry for each process running on the node. For example

{
  "version": 1,
  "comm_profiles": [],
  "cmds": [
    {
      "npes": 4,
      "pes_per_node": 1,
      "cpus_per_pe": 1
    }
  ],
  "pes": [
    {
      "localidx": 0,
      "cmdidx": 0,
      "nodeidx": 0
    },
    {
      "localidx": 0,
      "cmdidx": 0,
      "nodeidx": 1
    },
    {
      "localidx": 0,
      "cmdidx": 0,
      "nodeidx": 2
    },
    {
      "localidx": 0,
      "cmdidx": 0,
      "nodeidx": 3
    }
  ],
  "nodes": [
    {
      "id": 0,
      "hostname": "pinoak0003"
    },
    {
      "id": 1,
      "hostname": "pinoak0003"
    },
    {
      "id": 2,
      "hostname": "pinoak0003"
    },
    {
      "id": 3,
      "hostname": "pinoak0003"
    }
  ],
  "nics": []
}

garlick · 2025-01-24T15:58:39Z

Oh yeah, pes_per_node is set to 1. Hmm.

Do we have more than one broker per node again?

RaymondMichael · 2025-01-24T15:59:43Z

No. I'm running
srun --mpi=pmi2 -N2 -c 2 --exclusive flux/bin/flux start flux run -n4 -opmi=cray-pals ./apinfo_checker.py |jq

RaymondMichael · 2025-01-24T16:01:12Z

srun --mpi=pmi2 -N2 -c 2 --exclusive flux/bin/flux start flux resource list

     STATE NNODES NCORES NGPUS NODELIST
      free    128    128     0 pinoak[0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005,0005-0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006,0006]

garlick · 2025-01-24T16:05:30Z

If the NODELIST contains duplicates, that means there is more than one broker per node.

garlick · 2025-01-24T16:08:04Z

I found that if I add "-c " to the srun, then it gives each flux broker N cores to use on each node.

I think -c means "cores per task" so I would think that would run 1/2 as many brokers per node as there are cores?
So if you leave that off it doesn't work huh? What if you set it to the total number of cores on a node?

Edit: my slurm-fu is weak. Might need to call in a lifeline. @grondo? @ryanday36?

RaymondMichael · 2025-01-24T16:10:50Z

Adding --ntasks-per-node=1 and playing with -c seems to have solved it. There's still some oddness, but I can make progress.

garlick · 2025-01-24T16:11:53Z

Sorry that was more painful than expected!

RaymondMichael · 2025-01-24T16:13:24Z

Slurm is very different at different sites. This might provide some ideas for enhancing the doc/guide/start.rst file.

grondo · 2025-01-24T16:51:01Z

Yeah, notice all the WARNINGs in the srun(1) docs for -c:

If -c is specified without -n, as many tasks will be allocated per node as possible while satisfying the -c restriction. For instance on a cluster with 8 CPUs per node, a job request for 4 nodes and 3 CPUs per task may be allocated 3 or 6 CPUs per node (1 or 2 tasks per node) depending upon resource consumption by other jobs. Such a job may be unable to execute more than a total of 4 tasks.

WARNING: There are configurations and options interpreted differently by job and job step requests which can result in inconsistencies for this option. For example srun -c2 --threads-per-core=1 prog may allocate two cores for the job, but if each of those cores contains two threads, the job allocation will include four CPUs. The job step allocation will then launch two threads per CPU for a total of two tasks.

WARNING: When srun is executed from within salloc or sbatch, there are configurations and options which can result in inconsistent allocations when -c has a value greater than -c on salloc or sbatch.

RaymondMichael · 2025-01-27T13:22:33Z

When the launch is hung, are you able to ssh into the two nodes and do a lsof on the lowest rank on each node? That could help us understand what's going on with the sockets.
I've got Flux running locally but still can't reproduce the hang.

garlick · 2025-01-27T14:02:59Z

OK, in a 2 node allocation:

[garlick@tuolumne1005:~]$ flux run -n191 -o spindle.level=off -o pmi=cray-pals flux pmi  barrier
[hung]

Then for each node, the lowest number pid for the job tasks, which should be the lowest rank

[garlick@tuolumne1005:~]$ lsof -p 1782402|grep -v REG|grep -v DIR
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF        NODE NAME
flux    1782402 garlick    0r   CHR                1,3      0t0        6147 /dev/null
flux    1782402 garlick    1u  unix 0x0000000000000000      0t0   223042682 type=STREAM
flux    1782402 garlick    2u  unix 0x0000000000000000      0t0   223042684 type=STREAM
flux    1782402 garlick    3u  IPv4          222879132      0t0         TCP *:11998 (LISTEN)
flux    1782402 garlick    4u  IPv4          222879141      0t0         TCP tuolumne1005.llnl.gov:55538->tuolumne1005.llnl.gov:11998 (ESTABLISHED)
flux    1782402 garlick   13u  unix 0x0000000000000000      0t0   223042680 type=STREAM
flux    1782402 garlick   15u  unix 0x0000000000000000      0t0   223042682 type=STREAM
flux    1782402 garlick   17u  unix 0x0000000000000000      0t0   223042684 type=STREAM

[garlick@tuolumne1007:~]$ lsof -p 3486694|grep -v REG|grep -v DIR
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF        NODE NAME
flux    3486694 garlick    0u  unix 0x0000000000000000      0t0   454665872 type=STREAM
flux    3486694 garlick    1u  unix 0x0000000000000000      0t0   454665874 type=STREAM
flux    3486694 garlick    2u  unix 0x0000000000000000      0t0   454665876 type=STREAM
flux    3486694 garlick    3u  IPv4          454975260      0t0         TCP *:11998 (LISTEN)
flux    3486694 garlick   13u  unix 0x0000000000000000      0t0   454665872 type=STREAM
flux    3486694 garlick   15u  unix 0x0000000000000000      0t0   454665874 type=STREAM
flux    3486694 garlick   17u  unix 0x0000000000000000      0t0   454665876 type=STREAM

Edit: I was just confirming the task ranks when my allocation timed out but I did get the first one

[garlick@tuolumne1005:~]$ strings /proc/1782402/environ|grep FLUX_TASK
FLUX_TASK_LOCAL_ID=0
FLUX_TASK_RANK=96

RaymondMichael · 2025-01-27T14:20:58Z

That's crazy. The second host is supposed to the connect to the first host. I could be missing something, but it looks like it connected to itself.

garlick · 2025-01-27T14:46:51Z

Something looks off here. This is repeat of the same test, again 191 tasks over 2 nodes. R looks like this:

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "5-6",
        "children": {
          "core": "0-95",
          "gpu": "0-3"
        }
      }
    ],
    "nodelist": [
      "tuolumne[1005-1006]"
    ],
    "properties": {
      "pall": "5-6",
      "pdebug": "5-6"
    },
    "starttime": 1737987848,
    "expiration": 1737989648
  }
}

The apinfo shows

  "nodes": [
    {
      "id": 0,
      "hostname": "tuolumne1005"
    },
    {
      "id": 1,
      "hostname": "tuolumne1006"
    }
  ],

and to summarize the rest of the apinfo, ranks 0-95 are on nodeidx 0 (tuo1005) and 0-94) are on nodeidx 1 (tuo1006).

But visiting tuo1005 and looking in the environment of the lowest pid I see

FLUX_TASK_LOCAL_ID=0
FLUX_TASK_RANK=96

I would have expected to see rank 96 on tuo1006. Weirdly the shell seems to have put it on tuo1005, and that is not consistent with what's in the apinfo. And this is odd:

[garlick@tuolumne1005:~]$ flux run -n192 --label-io hostname|grep 96:
96: tuolumne1006
[garlick@tuolumne1005:~]$ flux run -n191 --label-io hostname|grep 96:
96: tuolumne1005

Edit: adding taskmaps for the above hostname runs (in order):

[[0,2,96,1]]
[[0,1,96,1],[1,1,95,1]]

garlick · 2025-01-27T16:10:24Z

Correct me if I'm wrong @RaymondMichael but it seems like this could explain the lsof observations since rank 96 is being told by the apinfo file that it's on the other node, so it connects to itself.

I've opened a flux-core bug on this - it looks like it must be our bug!

RaymondMichael · 2025-01-27T16:16:34Z

Yah, that would do it. I'm surprised that PMI didn't run into other issues with this before it got to the network code. I'll see what I can do to add code to check for this situation.

garlick · 2025-01-27T22:41:00Z

OK, I confirmed that the PMI hang does not reproduce with @grondo's fix applied from

shell: fix incorrect assignment of shell rank ids when broker ranks appear unordered in R flux-core#6584 fixes

I think we can close this. Sorry for the wild goose chase @RaymondMichael - and really appreciate your help.

RaymondMichael · 2025-01-28T12:07:08Z

No worries, we're here to help. This gave me a chance to re-learn the involved code and clean some things up in the process.

garlick mentioned this issue Jan 27, 2025

shell: assignment of tasks to shells contradicts taskmap on partially allocated elcap nodes flux-framework/flux-core#6582

Closed

garlick closed this as completed Jan 27, 2025

hang in MPI_Init with unbalanced ranks #222

hang in MPI_Init with unbalanced ranks #222

Comments

ryanday36 commented Sep 26, 2024

garlick commented Sep 26, 2024

grondo commented Sep 26, 2024

garlick commented Sep 26, 2024

grondo commented Sep 26, 2024

grondo commented Sep 26, 2024

garlick commented Sep 26, 2024

garlick commented Oct 7, 2024

garlick commented Oct 7, 2024

garlick commented Oct 7, 2024 • edited Loading

garlick commented Oct 7, 2024 • edited Loading

garlick commented Oct 7, 2024

RaymondMichael commented Jan 21, 2025

garlick commented Jan 21, 2025

RaymondMichael commented Jan 21, 2025

jameshcorbett commented Jan 21, 2025

garlick commented Jan 21, 2025 • edited Loading

garlick commented Jan 21, 2025

jameshcorbett commented Jan 21, 2025

RaymondMichael commented Jan 22, 2025

RaymondMichael commented Jan 22, 2025

garlick commented Jan 22, 2025 • edited Loading

garlick commented Jan 22, 2025

garlick commented Jan 22, 2025

garlick commented Jan 22, 2025 • edited Loading

garlick commented Jan 22, 2025

garlick commented Jan 22, 2025

garlick commented Jan 22, 2025

RaymondMichael commented Jan 22, 2025

garlick commented Jan 22, 2025

RaymondMichael commented Jan 23, 2025

garlick commented Jan 23, 2025

RaymondMichael commented Jan 24, 2025

garlick commented Jan 24, 2025 • edited Loading

RaymondMichael commented Jan 24, 2025 • edited Loading

garlick commented Jan 24, 2025 • edited Loading

RaymondMichael commented Jan 24, 2025

RaymondMichael commented Jan 24, 2025

garlick commented Jan 24, 2025 • edited Loading

RaymondMichael commented Jan 24, 2025

garlick commented Jan 24, 2025

RaymondMichael commented Jan 24, 2025

RaymondMichael commented Jan 24, 2025

garlick commented Jan 24, 2025 • edited Loading

garlick commented Jan 24, 2025 • edited Loading

RaymondMichael commented Jan 24, 2025

garlick commented Jan 24, 2025

RaymondMichael commented Jan 24, 2025

grondo commented Jan 24, 2025

RaymondMichael commented Jan 27, 2025

garlick commented Jan 27, 2025 • edited Loading

RaymondMichael commented Jan 27, 2025

garlick commented Jan 27, 2025 • edited Loading

garlick commented Jan 27, 2025

RaymondMichael commented Jan 27, 2025

garlick commented Jan 27, 2025

RaymondMichael commented Jan 28, 2025

garlick commented Oct 7, 2024 •

edited

Loading

garlick commented Oct 7, 2024 •

edited

Loading

garlick commented Jan 21, 2025 •

edited

Loading

garlick commented Jan 22, 2025 •

edited

Loading

garlick commented Jan 22, 2025 •

edited

Loading

garlick commented Jan 24, 2025 •

edited

Loading

RaymondMichael commented Jan 24, 2025 •

edited

Loading

garlick commented Jan 24, 2025 •

edited

Loading

garlick commented Jan 24, 2025 •

edited

Loading

garlick commented Jan 24, 2025 •

edited

Loading

garlick commented Jan 24, 2025 •

edited

Loading

garlick commented Jan 27, 2025 •

edited

Loading

garlick commented Jan 27, 2025 •

edited

Loading