-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run osu-micro-benchmarks collectives with OMPI-v5.0.5 #12717
Comments
The error is coming from hcoll library, because of failure to detect GPU memory. It seems the Cuda version on the system does not match the Cuda version supported by the HPC-X package (and hcoll library in it) that is used. |
@yosefe Thanks for the comments. CUDA Version thats installed in this machine is v11.2 I haven't installed HPC-X. I just cloned ucx-v1.17.0 and OMPI-v5.0.5 from git These are my config commands.
|
@goutham-kuncham perhaps there is hcoll installed on the system, from MLNX_OFED? since the backtrace shows mca_coll_hcoll_allreduce. |
@yosefe Its seems to work if I disable hcoll at runtime.
But when I run the benchmark with validation, it fails.
|
@goutham-kuncham does the data validation error happen only with cuda memory? |
I got the same validation failure when I run CPU benchmark as well when hcoll is disabled.
However, when I enable hcoll back, CPU benchmark validation Passes
I got same behavior with and without hcoll
|
regarding the TCP issue, can you try setting the network device using |
@yosefe Sorry, I missed that. I get the same validation failure after setting device. I tried with both
|
so it seems to be some issue with OpenMPI collective component, does it happen on older OpenMPI versions? |
I've noticed a very similar data validation issue at 16B for osu_collectives! I am using OpenMPI 5.0.3 with the OPX OFI provider. This 16B data validation only occurs with OSU 7.4, not OSU 7.3. I've tested on AMD cpus and Intel cpus, both reproduce. Also, it only occurs for me with the This issue occurs for me with or without the I have been debating whether or not to report this to mvapich community. |
@wenduwan woops, my comment removed the |
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it. |
Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned. I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you! |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OMPI Version: v5.0.5
UCX Version: v1.17.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
RHEL CentOS 7
arch - x86_64 (CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz) GPU: Tesla V100-PCIE-32GB
InfiniBand
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
I am unable to run osu-micro-benchamarks collectives GPU version (Specifically I am interested in osu_reduce, osu_allreduce, osu_allgather)
Below is my configuration and run commands:
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
The text was updated successfully, but these errors were encountered: