-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User feedback on ofi-omnipath configuration #14
Comments
First I want to ask about the following: I don't recall ever seeing a The line number and Please also ask the user for the output of |
Hi, I reported this issue.
This is standard language in HTR for execution on a compute node, so I am confident about that. I will re-run the program in debug mode with your suggested flags and get back to you with more information. tagging @mariodirenzo, HTR's author |
output of
I've also attached the logfile for the execution of the program with the suggested debug environment. |
Thanks @cmelone Version 1.7 of libfabric is pretty old. So, my next step is to see if I can reproduce just by building an old libfabric. |
I've confirmed that I can reproduce the reported error with libfabric versions 1.7.0, 1.8.0 and 1.9.0. @cmelone do you have access (such as via Use of a recent libfabric is my recommended fix, if possible. However, I have determined why older versions are failing and believe I can make GASNet-EX work with older libfabric if that is necessary. Please let me know. |
Thanks, Paul. I got this info from LLNL, so I'll be building my own libfabric and test with a newer version, unless Elliott or Mario have thoughts about the feasibility of having users build their own libfabric.
|
@cmelone if you're not comfortable building your own libfabric, then as Paul says I think we can deploy a patch to allow GASNet-EX to work with older libfabric versions that would get you past the error at hand. However we cannot speak to what bug fixes libfabric's psm2 provider has made since Jan 2019 that might affect the execution of your application. So independently from this particular error, it's probably still advisable to explore installing a newer libfabric to pull in the last four years of libfabric development. |
if you could deploy the patch, I'd be happy to test it. building libfabric for myself is not a concern; the application has many users on Quartz and I might err on the side of reducing the complexity of the installation instructions now and waiting for the admins to update the internal libfabric version in the coming months. Thanks |
@cmelone The patch is a one-line change included in the GASNet-EX bug report I've created to track this issue: I am working now to craft a more "surgical" version for inclusion in the set of patches applied by the |
Proposed fix in #15 |
The small test case I've been running succeeded with the patch applied. i'm going to run some larger scale tests in release mode to confirm. |
Can confirm that this issue is resolved. Thank you for the assistance @PHHargrove @elliottslaughter @bonachea |
@cmelone Thanks for confirming. I'd like to check whether this means that you'll be able to migrate away from GASNet 1.32.0 on this machine. I believe you are the last major user using this version of GASNet. |
As far as I'm aware, yes. but would like @mariodirenzo to confirm |
Apologies for the late update, but I will be running a few more performance tests before Mario and I can confirm |
@cmelone @mariodirenzo you might already know this, but one of the benefits of migrating to a current version of GASNet-EX (aside from dropping reliance on unmaintained code) is that it also allows you to enable Legion/Realm's newly rewritten The details of how to select that backend might vary depending on your build system, but it's probably something like: |
is |
That's a build flag: If you're using CMake, the flag is |
thank you everyone for the assistance. we are officially moving off GASNet 1.32.0 on Quartz |
I am mirroring user feedback on the use of the
ofi-omnipath
network configuration here so that we can keep in sync.I have asked the user to proceed with the
GASNET_DEBUG=1
test, but if there is something else we should do please let us know.CC @PHHargrove @bonachea
The text was updated successfully, but these errors were encountered: