Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User feedback on ofi-omnipath configuration #14

Closed
elliottslaughter opened this issue Jan 30, 2023 · 20 comments
Closed

User feedback on ofi-omnipath configuration #14

elliottslaughter opened this issue Jan 30, 2023 · 20 comments

Comments

@elliottslaughter
Copy link
Contributor

I am mirroring user feedback on the use of the ofi-omnipath network configuration here so that we can keep in sync.

following up on this, I compiled regent and HTR with two versions of gasnet: 1.32.0 (conduit psm) and 2022.9.2 (conduit ofi-omnipath)

Running a simple HTR test case with the old gasnet config works as expected. However, with the newer version and conduit, I get an error.

gcc/10.2.1 openmpi/2.0.0 python/3.8.2
[melone1@quartz2498:WeakScaling]$ cat slurm-12969093.out
Sending output to /usr/workspace/melone1/codes/quartz-new-gasnet/htr/testcases/scalingTest/WeakScaling/output/1
Invoking Legion on 1 rank(s), 1 node(s) (1 rank(s) per node), as follows:
/usr/workspace/melone1/codes/quartz-new-gasnet/htr/src/prometeo_ConstPropMix.exec -i /usr/workspace/melone1/codes/quartz-new-gasnet/htr/testcases/scalingTest/WeakScaling/output/1.json -o /usr/workspace/melone1/codes/quartz-new-gasnet/htr/testcases/scalingTest/WeakScaling/output/1 -ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 34 -ll:ostack 8 -ll:util 2 -ll:io 1 -ll:bgwork 1 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 100000 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at language/gasnet/GASNet-2022.9.2/ofi-conduit/gasnet_ofi.c:974: fi_endpoint for rdma failed: -95(Operation not supported)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
srun: error: quartz2652: task 0: Aborted (core dumped)

I have asked the user to proceed with the GASNET_DEBUG=1 test, but if there is something else we should do please let us know.

CC @PHHargrove @bonachea

@PHHargrove
Copy link
Contributor

First I want to ask about the following:
Invoking Legion on 1 rank(s), 1 node(s) (1 rank(s) per node), as follows:
That makes me suspicious that this is either NOT the intended run at all, or that your user might have launched on a frontend instead of a compute node.
The rest of this message assumes this really did run on a compute node.

I don't recall ever seeing a -95(Operation not supported) failure.
So, I don't currently have any good suggestions.
I will see if I can find anything in old runs, or maybe try some corner cases (including single-node runs) on the OPA system(s) I have access to.

The line number and fi_endpoint for rdma failed are sufficient to uniquely identify what has failed.
So, I am not confident the debug output (backtrace in particular) will provide any new information.
However, I do still want to see it when it is available.
If it is not too late, please also ask them to set GASNET_VERBOSEENV=1 GASNET_SPAWN_VERBOSE=1 GASNET_TRACEFILE=- GASNET_TRACEMASK=I in the debug-mode runs. The resulting trace output will help me understand what is going on prior to the failure.

Please also ask the user for the output of fi_info -l run on both a compute node and the node where GASNet and the application are compiled. This will give version numbers which might help me to reproduce, as well as check for the possibility of a mismatch.

@cmelone
Copy link

cmelone commented Jan 31, 2023

Hi,

I reported this issue.

That makes me suspicious that this is either NOT the intended run at all, or that your user might have launched on a frontend instead of a compute node.

This is standard language in HTR for execution on a compute node, so I am confident about that.

I will re-run the program in debug mode with your suggested flags and get back to you with more information.

tagging @mariodirenzo, HTR's author

@cmelone
Copy link

cmelone commented Jan 31, 2023

output of fi_info -l on node where gasnet and app were compiled (is the same on the compute node)

psm2:
    version: 1.7
psm:
    version: 1.7
usnic:
    version: 1.0
ofi_rxm:
    version: 1.0
ofi_rxd:
    version: 1.0
verbs:
    version: 1.0
UDP:
    version: 1.1
sockets:
    version: 2.0
tcp:
    version: 0.1
ofi_perf_hook:
    version: 1.0
ofi_noop_hook:
    version: 1.0
shm:
    version: 1.0
ofi_mrail:
    version: 1.0

I've also attached the logfile for the execution of the program with the suggested debug environment.

slurm-12999457.log

@PHHargrove
Copy link
Contributor

Thanks @cmelone

Version 1.7 of libfabric is pretty old. So, my next step is to see if I can reproduce just by building an old libfabric.
I don't (yet?) see anything else concerning in the log you provided.

@PHHargrove
Copy link
Contributor

I've confirmed that I can reproduce the reported error with libfabric versions 1.7.0, 1.8.0 and 1.9.0.
With libfabric 1.10.0 and newer I get correct operation.

@cmelone do you have access (such as via module load libfabric/...) to a sufficiently recent libfabric?

Use of a recent libfabric is my recommended fix, if possible. However, I have determined why older versions are failing and believe I can make GASNet-EX work with older libfabric if that is necessary. Please let me know.

@cmelone
Copy link

cmelone commented Jan 31, 2023

Thanks, Paul. I got this info from LLNL, so I'll be building my own libfabric and test with a newer version, unless Elliott or Mario have thoughts about the feasibility of having users build their own libfabric.

The libfabric that is currently installed is part of the OS distribution and not something that we can update. That said, quartz will be updated from the TOSS 3 OS to TOSS 4 in the coming months, which will also update libfabric to 1.16.1.

@bonachea
Copy link
Contributor

@cmelone if you're not comfortable building your own libfabric, then as Paul says I think we can deploy a patch to allow GASNet-EX to work with older libfabric versions that would get you past the error at hand.

However we cannot speak to what bug fixes libfabric's psm2 provider has made since Jan 2019 that might affect the execution of your application. So independently from this particular error, it's probably still advisable to explore installing a newer libfabric to pull in the last four years of libfabric development.

@cmelone
Copy link

cmelone commented Jan 31, 2023

if you could deploy the patch, I'd be happy to test it. building libfabric for myself is not a concern; the application has many users on Quartz and I might err on the side of reducing the complexity of the installation instructions now and waiting for the admins to update the internal libfabric version in the coming months. Thanks

@PHHargrove
Copy link
Contributor

@cmelone The patch is a one-line change included in the GASNet-EX bug report I've created to track this issue:
Bug 4567 - broken support for psm2 provider in libfabric < 1.10

I am working now to craft a more "surgical" version for inclusion in the set of patches applied by the Makefile in this repo.

@PHHargrove
Copy link
Contributor

Proposed fix in #15

@elliottslaughter
Copy link
Contributor Author

I have merged #15.

@cmelone Please update to the latest control_replication, delete your copy of language/gasnet, and rerun setup_env.py.

@cmelone
Copy link

cmelone commented Feb 1, 2023

The small test case I've been running succeeded with the patch applied. i'm going to run some larger scale tests in release mode to confirm.

@cmelone
Copy link

cmelone commented Feb 1, 2023

Can confirm that this issue is resolved. Thank you for the assistance @PHHargrove @elliottslaughter @bonachea

@elliottslaughter
Copy link
Contributor Author

@cmelone Thanks for confirming. I'd like to check whether this means that you'll be able to migrate away from GASNet 1.32.0 on this machine. I believe you are the last major user using this version of GASNet.

@cmelone
Copy link

cmelone commented Feb 1, 2023

As far as I'm aware, yes. but would like @mariodirenzo to confirm

@cmelone
Copy link

cmelone commented Feb 6, 2023

Apologies for the late update, but I will be running a few more performance tests before Mario and I can confirm

@bonachea
Copy link
Contributor

bonachea commented Feb 6, 2023

@cmelone @mariodirenzo you might already know this, but one of the benefits of migrating to a current version of GASNet-EX (aside from dropping reliance on unmaintained code) is that it also allows you to enable Legion/Realm's newly rewritten gasnetex communication backend. IIUC the legacy gasnet1 Realm backend you've probably been using should still work. However if you're running performance tests I recommend also evaluating the new gasnetex backend, which includes many new optimizations and might result in a measurable improvement.

The details of how to select that backend might vary depending on your build system, but it's probably something like: REALM_NETWORKS=gasnetex

@cmelone
Copy link

cmelone commented Feb 6, 2023

is REALM_NETWORKS a runtime flag or do I need to compile the application with it?

@elliottslaughter
Copy link
Contributor Author

That's a build flag: REALM_NETWORKS=... make.

If you're using CMake, the flag is -DLegion_NETWORKS=....

@cmelone
Copy link

cmelone commented Feb 7, 2023

thank you everyone for the assistance. we are officially moving off GASNet 1.32.0 on Quartz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants