-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI 5.0.x can't initialize on x86 host on heterogeneous system #12947
Comments
Here is your problem: [wolpy09:1073276] [prterun-wolpy10-1356013@0,1] DMODX REQ FOR prterun-wolpy10-1356013@1:0
[wolpy09:1073276] [prterun-wolpy10-1356013@0,1] DMODX REQ REFRESH FALSE REQUIRED KEY pml.ucx.5.0
[wolpy10:1356013] [prterun-wolpy10-1356013@0,0] dmdx:recv processing request from proc [prterun-wolpy10-1356013@0,1] for proc prterun-wolpy10-1356013@1:0
[wolpy10:1356013] [prterun-wolpy10-1356013@0,0] dmdx:recv checking for key pml.ucx.5.0
[wolpy09:1073276] prted/pmix/pmix_server_fence.c:268 MY REQ INDEX IS 0 FOR KEY pml.ucx.5.0
[wolpy10:1356013] [prterun-wolpy10-1356013@0,0] dmdx:recv key pml.ucx.5.0 not found - delaying UCX is looking for a particular key that the other process is failing to provide - last time I saw this, it was because the OMPI version on one side was different from the version on the other side (i.e., the pml/ucx component has a different version, which means the key name is different since it has the version number in it). Afraid you'll need help from the OMPI UCX folks from here - has nothing to do with PMIx. |
thanks @rhc54 for looking over this. the thing is this does occur even when running without ucx compiled. looks like I also provided some outdated debug output with a ucx configuration error in there that was fixed (the hang error occurs on the fallback to btl.tcp later in that old output) Here is a non-ucx output ran just now:
notice the remote response inbetween. And here is the with ucx output as comparison:
As this occurs with and without ucx i assume it has nothing to do with that. Here are ompi_info and pmix_info outputs from the non-ucx compiled version:
and from the ARM DPU:
There are quite a few differences in ompi_info due to one of them being automatically build with fortran support. the DPU additionalls has the a shortfloat extension. the DPU is also missing xpmem, hcoll and avx MCAs. other than that the gcc versions (which works when using openmpi 4.1.7) The differences I can see in pmix is pstat test on the DPU instead of linux on the x86 host |
I don't know why one would insist this has something to do with PMIx or with mpirun - all the output reads as perfectly fine. Procs are started, info is requested and returned to the MPI layer. At that point, PMIx and the runtime are done. What is odd is that OMPI is doing its fence during MPI_Init, but then using dmodex to retrieve the values - which implies that OMPI didn't exchange info during the fence. There are some flags for doing that, but I don't see them set on your cmd line (could perhaps be in your environment instead). Still, the data is being returned - how you got it doesn't matter in the end. Anyway, I think you have a problem with the MPI layer - someone else will have to address it. This doesn't appear to have anything to do with PMIx or the runtime. |
I didn't want to insist it has anything to do with pmix. I merely stated that I tried to debug it with colleagues who work on pmix, because they are experienced with mpi initialization, but we couldn't find the issue. The only thing I know for sure is it's an issue with openmpi 5.0.x (specifically tested 5.0.6, 5.0.5, 5.0.3) and that using the exact same installation and runtime steps works with openmpi 4.1.7 (i literally executed the same commands in order from history just switching out version numbers) |
Understood - my only point was that I see no indication of any problems in the non-MPI areas. However, I also don't see any indication of an error during initialization, so I'm not sure where that conclusion is coming from. All I can see from the info provided thus far is that your x86 host isn't doing something you expect. For the MPI folks here to help, it might be useful if you could explain a bit more about what the x86 host is failing to do. It sounds like you are saying it was supposed to print something immediately after |
yes, to clarify:
hello world never prints on the host, while it does print on the dpu. happy to rerun my tests with additional/different debug flags and provide the output, if anyone sends me what they need. |
You might want to put a |
@jsquyres fair enough. I added it and reran the test. same result. the full code looks like this so would've had a newline after a successful int main(int argc, char** argv) {
int rank;
MPI_Init(&argc, &argv);
printf("hello world\n");
fflush(stdout);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
char* system_info = NULL;
if(get_system_info(&system_info) != 0){
perror("Failed to get system information.");
system_info = NULL;
}
printf("rank %i on %s says hello\n", rank, system_info);
free(system_info);
MPI_Finalize();
return 0;
} (get systeminfo calls uname) |
This is a heterogeneous setup. You need to compile OMPI with heterogeneous support. However, I know it will allow the OB1 to work properly (or at least it did not long ago), but I'm not sure the UCX PML supports that. |
i'll recompile and see if that fixes it. any idea why that's not necessary for 4.1.7, but would be necessary for 5.0.x? |
@bosilca could you clarify what you mean with that and what I should do? There is Furthermore, the documentation https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/misc.html says: So enabling that might be actively harmful and not necessary to begin with. So what should I compile with? |
@dfherr Here is what you can do in order to troubleshoot this problem
|
The stack trace of the hang looks like this on the host:
|
Please try without the hcoll collective component, aka. |
@dfherr Unless you built with an external PMIx library, the |
yes, that fixes the issue. Should I report the issue to hcoll developers somewhere? |
HCOLL is part of HPCX (the NVIDIA HPC solution). @janjust do you know how to report a bug related to HCOLL ? |
Consider it reported, but this specific setup (host/dpu) we will not fix, if it's an issue at all. Having said that: this should work just fine. And you don't need to build with heterogenous support. |
i mean, yea it should work, but it doesn't. just to clarify. it's not specific to UCX. this happens with and without ucx compiled. i guess my mpi 4.1.7 works because it doesn't use hcoll? |
yeah, looks that way |
I'm going to try my best to describe the issue. We tried to debug this internally with people working on pmix and didn't really get to a solution other than downgrading to openmpi 4.1.7 (which works as expected).
Background information
I am working on two x86 nodes running Rocky 9.1 with two Nvidia Bluefield-2 DPUs (one per node) running a recent Nvidia provided bfb image (
Linux 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64
to be precise).The NIC/DPU is configured in Infininband mode and ssh connection between all 4 hosts is functional. Launching a simple MPI hello world works using openmpi 4.1.7 (tried with seperate installations of --without-ucx and --with-ucx=version 1.17.0).
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I tried 5.0.3, 5.0.5, 5.0.6 from the official tarbals from the openmpi download page. Each was compiled with
--with-pmix=internal --with-hwloc=internal
and then once with ucx 1.17.0 and without ucx. (i'm working with ucx so i wanted serperate mpi installs to compare).Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
root installation:
Please describe the system on which you are running
Details of the problem
I'm trying to launch mpi processes between the DPU and the host. the process starts on both ranks, the remote (dpu) rank finishes initialization and prints a debug output containing its rank, the remote stdout gets captured and arrives back at the mpirun host, but the x86 host never finishes initialization. With or without ucx available makes no difference.
host to host setup works. dpu to dpu setup works (when mpirun from host, NOT when mpirun from dpu) and host to dpu hangs. All commands from host:
All of the above commands including starting mpirun on the DPUs work with openmpi 4.1.7 both with and without ucx compiled.
With additional debug output the hang seems to always occur after a dmdx key exchange was done:look comment below for up-to-date debug output
happy to provide further debug output. For now I'm fine running openmpi 4.1.7, but I felt I should report this issue with openmpi 5.0.x regardless.
The text was updated successfully, but these errors were encountered: