-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20220906
Geoffrey Paulsen edited this page Oct 4, 2022
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Brendan Cunningham (Cornelis Networks)
- Christoph Niethammer (HLRS)
- David Bernhold (ORNL)
- Edgar Gabriel (UoH)
- Geoffrey Paulsen (IBM)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- Tommy Janjusic (nVidia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia)
- Aurelien Bouteiller (UTK)
- Austen Lauria (IBM)
- Brandon Yates (Intel)
- Brian Barrett (AWS)
- Charles Shereda (LLNL)
- Erik Zeiske
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Jan (Sandia -ULT support in Open MPI)
- Jingyin Tang
- Joseph Schuchart
- Josh Fisher (Cornelis Networks)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Sam Gutierrez (LLNL)10513
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Xin Zhao (nVidia)
- Thursday HAN/Adapt wrapup decision.
- Contact Geoff Paulsen if you need webex infoN
- Multiple weeks on CVE from nvidia.
- v4.1.5
- Schedule: targeting ~6 mon (Nov?)
- No driver on schedule yet.
- Potential CVE from 4 years ago issue in libevent.. but might not need to do anything.
- Updated one company reported scanner didn't report anything.
- Waiting on confirmation that patches to remove dead was enough.
- SLURM allocation.
- RC this week.
- Sept 30.
- Finally swapped out the PPRTE submodule pointer to point to v3.0 branch
- Did it without the SLURM fix, but there was some traction there.
- Posted Issue Open-MPI #10698 with about 13 issue, that will need
- NEED an mpirun manpage
- NEED mpirun --help
- Need all these fixes before PRTE ships v3.0.0
- Any of these issues complex?
- Testing mpirun command line options.
- Supposed to do automatic translations from old command line options to new options.
- Are we planning to get rid of options at some point?
- Not printing deprecated warning by default.
- We've made new options (that are the new way), but if we're not encouraging people to go to them, why?
- Can we even map old to new options one-to-one.
- We "own" the szitso component and we could ditch new options, and only use old options if we want.
- Before we force any change, we should get user's
- Old ones had auto-completion.
- If we have old options that are going to new options, weird that we don't print the messages.
- v5.0 was supposed to be pretty disruptive, but if we go back and make it less disruptive, that's fine, but we are kinda saying that the old options are the way.
- Do we want HW_GUIDED in v5?
- No discussion.
- It's be nice to make a test suite that assumes 2-4 Nodes with 4ppr or so... *
- Schedule:
- PMIx and PRRTE changes coming at end of August.
- PMIx v3.2 released.
- Try to have bugfixes PRed end of August, to give time to iterate and merged.
- Still using Critical v5.0.x Issues (https://github.com/open-mpi/ompi/projects/3) yesterday
- PMIx and PRRTE changes coming at end of August.
- Docs
-
mpirun --help
is OUT OF DATE.- Have to do this relatively quickly, before PRRTE releases.
- Austen, Geoff and Tomi will be
- REASON for this, is because mpirun command line is in PRRTE.
-
- mpirun manpage needs to be re-written.
- Docs are online and can be updates asyncronously.
- Jeff posted PR to document runpath vs rpath
- Our configure checks some linker flags, but there might be default in linker or in system that really governs what happens.
- Symbol Pollution - Need an issue posted.
- OPAL_DECLSPEC - Do we have docs on this?
- No. Intent is where do you want a symbol available?
- Outside of your library, then use OPAL_DECLSPEC (like Windows DECLSPEC)
- I want you to export this symbol.
- No. Intent is where do you want a symbol available?
- need to clean up as much as possible.
- Open-MPI community's perspective, our ABI is just MPI_Symbols
- Still unfortunate. We need to clean up as much as possible.
- OPAL_DECLSPEC - Do we have docs on this?
- Case of QThreds, where they need a recursive lock.
- A configury problem was fixed.
- Not merged into main or v5 yet.
- still a couple of discussion points.
- No discussion. Still some changes needed before we can retest/rereview.
- ShowLoad errors came out of this.
- Intent is to turn this error off by default.
- In Open MPI v5, we've slurped all mca libraries into libmpi (still can via configure)
- If you build them as a dso (say cuda component)
- dlopen will fail because cuda isn't there.
- and mca framework will emit a warning on STDERR.
- Accelerators are expensive, and therefore you might not have them on all nodes.
- BUT customers have hit this ERROR in the field.
- In this case.
- What if we make this switch not be a boolean (always show warning, or don't show the warning)
- Jeff posted 10763.
- Two mechanisms... could be accelerators as DSOs.
- Because if you're in libmpi.so, whole job will not run.
- Overall Edgar likes the ideas of the PR.
- How is Open MPI (or PRTE) dealing with slurm?
- Because slurm component is built every time, even if it doesn't find slurm.
- Slurm Headers/libs are GPL
- So Open MPI fork/exec srun/
- How is Open MPI (or PRTE) dealing with slurm?
- MCA component can still do a dlopen on required libraries
- HCOLL component must be dlopening also
- If we don't get Accelerator Framework in v5, is there any AMD accelerator support?
- Not much... just some specific derivated datatype
- No Streams, No Abstration, etc.
- Would be a big gap.
- William will try
- Edgar also has a follow up commit.
- Waiting until big commit is merged into main, to not further complicate this commit.
- Any testing with libfabric and accelerator support?
- Edgar is hoping to test this week.
- If something is missing, it'd probably be on the libfabric
- Switching to builtin atomics,
- 10613 - Prefered PR. GCC / Clang should have that.
- Next step would be to refactor the atomics for post v5.0.
- Waiting on Brian's review and CI fixes.
- Joseph will post some additional info thing in the ticket
- We're probably not getting together in person anytime soon.
- So we'll send around a doodle to have time to talk about our rules.
- Reflect the way we worked several years ago, but not really right now.
- we're to review the admin steering committee in July (per our rules):
- we're to review the technical steering committee in July (per our rules):
- We should also review all the OMPI github, slack, and coverity members during the month of July.
- Jeff will kick that off sometime this week or next week.
- In the call we mentioned this, but no real discussion.
- Wiki for face to face: https://github.com/open-mpi/ompi/wiki/Meeting-2022
- Might be better to do a half-day/day-long virtual working session.
- Due to company's travel policies, and convenience.
- Could do administrative tasks here too.
- Might be better to do a half-day/day-long virtual working session.