-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Faults Due to Exceeding Libfabric Tx/Rx Context Limits in SST RDMA Transport #4329
Comments
Hi, and sorry for the delay in responding. Can you tell me what machine you're running on here? I don't know of a workaround for this in libfabric now (probably it would require us to be able to reproduce and try to find a fix), but you might try using the mpi data plane if its available. (MPI has some significant disadvantages when used in SST, but it does have the advantage that on HPC resources it has probably seen a lot of machine-specific optimization...) |
Hi, I am running on jean-zay supercomputer with these specs I tried MPI option with SST but it always reverts back to RDMA and the same segmentation fault.
Should Adios be compiled without the libfabrics so that it doesn't fall back to RDMA? The MPI available on the cluster does not use libfabrics ( |
From the log, the MPI data plane isn't available in this compilation, likely because CMake decided that it wasn't likely to work. You can force it to be included by including "-DADIOS2_HAVE_MPI_CLIENT_SERVER=true". This is probably a worthwhile thing to try because sometimes our test for that is too conservative. Compiling without libfabric won't affect this, and it doesn't matter if MPI uses libfabric or not. |
Hi, Sorry, I was on vacation. Okay. Thanks for sharing this. I will to recompile. But, it seems using adios with Omnipath has been difficult to say the least. There is always some segfault popping up. |
If libfabric fulfilled it's promise of being a nice universal interface to all the RDMA, life would be easier. Unfortunately it falls far short, requiring code to be customized for each variation. Makes things difficult on the more research end of things like data streaming. |
I am running into segmentation faults (SIGSEGV), most likely due to running out of libfabric contexts with SST RDMA transport. I am running a simple reader-writer scenario from your examples where I submit 80 writers ( 2 nodes) and 1 reader (1 node). Reader runs on a separate node. All jobs are running a single MPI rank.
After increasing the log level for libfabric. This is what I see,
If I run 40 clients, the scripts work fine. But, beyond 40, seg faults occur. Given the context related warning, it is probably that we run out of resources. But, looking at fabric's pages, I see that there is a way to share contexts with multiple endpoints, but not sure how to proceed.
Is there any workaround for this situation ?
Specs
Node details
The text was updated successfully, but these errors were encountered: