-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obsolete GASNet versions in repository #10
Comments
Summarizing from the linked comment... we think the only remaining use case for the ancient/unsupported GASNet 1.32.0 dependency is OPA systems (which for several years were not supported by GASNet-EX). GASNet-EX restored support for ofi-conduit in v2022.3.0, so we believe that GASNet-EX 2022.3.0+ should be "at least as good as" that ancient G-1 version (and possibly much better). There is ongoing work to improve GASNet-EX's use of the new Cornelis stack (and their new OPX provider), but we believe the ancient GASNet-1 version is not any better off in this regard. |
While no release of GASNet (1 or EX) currently supports the "opx" libfabric provider written by Cornelis, I have no reason to doubt that the "psm2" provider is supported at least as well in 2022.9.2 as in any GASNet-1 release. In fact, I would not be surprised if support is better in the new release due to general improvements. |
I am trying to contact users of LLNL Quartz to see what the current situation is over there. So to summarize:
If we can migrate users away from GASNet 1 for this purpose, I will remove it from this repository, which should enable further modernizations. |
@elliottslaughter That's a correct summary, and sounds like the right plan forward. |
I just noticed there's a configs/config.psm.release file that enables psm-conduit (GASNet directly over Omni-Path PSM2). This conduit was officially deprecated in GASNet 1.30.0 (7a86d93d5b) after Intel (who originally contributed the conduit implementation for their hardware) ceased development on PSM2 and this conduit, in favor of ofi-conduit. psm-conduit appeared as a deprecated conduit in the 1.32.0 release, but has not been maintained since mid-2017 and had several known performance/stability issues at that time. psm-conduit was never ported to the GASNet-EX branch. @elliottslaughter Hopefully none of your 1.32.0 users are still using psm-conduit? If they are I'd strongly recommend their upgrading to ofi-conduit in GASNet-EX 2022.3.0+, for reasons of performance, stability and support. Note this presumably requires creating a new |
Fwiw, searching through old email I found mention of another OmniPath system at LLNL: |
I am in contact with some of the relevant users now and have asked them to test the new configuration. |
FWIW the new For better or worse the current The configure-established default can alternatively be overridden at runtime (assuming the PMI library was detected) by setting envvar |
PR #13 contains my proposed resolution for this issue, assuming the relevant users confirm that ofi-conduit is working for them. |
I seem to recall the reason for I realize in an theoretically pure world, the spawner is independent from the conduit, but in our practical usage they seem to be highly correlated (few users ask us to change these defaults), and with PSM2 specifically, I think it can be argued that PMI is the only sensible option, particularly because I'm not actually aware of any Omni-Path machines besides Quartz and Ruby. Am I missing anything? |
@elliottslaughter you are correct that PSM2 has a one-endpoint-per-process restriction that makes it harder to use MPI spawner (but not impossible, as many MPI's provide envvar overrides that force MPI to use TCP/IP instead of PSM2, avoiding the problem). In any case due to this restriction I'd never recommend mpi-spawner or
What you are missing is the "third option" which is ssh-based spawning (the default spawner for psm/ofi/ibv/ucx-conduits when MPI spawning is disabled). ssh spawner uses fork and TCP sockets for spawning so it (often) does not consume precious resources on the high-speed network, and perhaps more importantly it does not require library dependencies on other software such as a working PMI library+daemon or working MPI library. The main downsides of ssh-spawner are (1) it requires passwordless (e.g. host or agent authenticated) ssh connections to the compute nodes (which can sometimes run afowl of system firewall rules), and (2) it generally requires setting at least one envvar (GASNET_SSH_NODEFILE or GASNET_SSH_SERVERS) and in rarer cases others to get a working spawn.
It's true that big Omni-Path installations are rare in DOE, but they remain more common in Europe and elsewhere. 35 of the Nov 2022 top500 are using Omni-Path, including a German system that is currently number 29, and TACC's Stampede2 which is number 51.
The statement above has definitely been true for conduits like aries-conduit, gemini-conduit, pami-conduit, etc which target a proprietary network that is exclusively sold as part of a vendor-integrated system with a software environment that includes a uniform job spawner. However it's far less true for conduits like ibv-conduit, psm-conduit, ucx-conduit, ofi-conduit (non-Cray providers) that target commodity or near-commodity NICs that are often sold separately from the system and the admin is left to design their own software environment with job spawner. These conduits operate in the "wild west" of job spawning, and we routinely see use each of ibv-conduit's three spawning options as the best choice for a given system. Anecdotally Omni-Path is also somewhat popular in smaller departmental-style clusters. Such clusters are less likely to be professionally managed, making job spawning flexibility more important. |
Thanks, @bonachea, for the context. I appreciate it. My gut feeling right now is that we need to serve the users we have, before we try to serve (hypothetical) users who might run into issues later on. As you point out, spawners are a larger issue than just the OmniPath network, and it seems to me that this is a larger refactoring that needs to be done in this repository. One option would be to add a For now I plan to add |
To be clear, I was not intending to argue for the refactoring you describe. The Makefile already provides the
Note that second argument would be a change in behavior relative to the Note the current behavior of |
Ok, thanks. Given this situation I'll leave it as is for now, and be in contact with the main users of OmniPath to see if this work for them. Is there a way to set a preference order on the default spawner? Usually we'll probably want something like:
SSH has the top default doesn't make a lot of sense for most Legion users unless nothing else was successfully configured. |
@elliottslaughter I want to note that the list you suggest is one I'd not favor as a default, because I've seen enough systems where the PMI libraries exist but are useless for job launch. So, the best option might be to provide the means to express the desired precedence, rather than make that decision in advance. Prior to this discussion, I'd not given any thought to a prioritized list. However, use of a colon-delimited list (as for |
#14 was the last known user of 1.32.0. I will now remove that version from the repository. |
I merged #13. If there are any other modernizations you'd like to do, let me know. |
@elliottslaughter I think all our main configure modernization concerns are now resolved. Thanks! |
Per #9 (comment), we have some obsolete GASNet versions in this repository. As of this moment, the oldest versions are 1.32.0 and 2021.3.0.
Per #9 (comment) we may be able to replace GASNet 1.32.0 with a new version of EX (at least 2022.3.0), but this depends on the driver used on Quartz.
The text was updated successfully, but these errors were encountered: