-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20211116
Geoffrey Paulsen edited this page Nov 21, 2021
·
1 revision
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- Christoph Niethammer (HLRS)
- Corey A. Henderson (AWS)
- David Bernholdt (ORNL)
- Geoffrey Paulsen (IBM)
- George Bosilca (UTK)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart (HLRS)
- Josh Hursey (IBM)
- Sriraj Paul (Intel)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic (NVIDIA)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Edgar Gabriel (UH)
- Erik Zeiske (HPE)
- Geoffroy Vallee (ARM)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (NVIDIA))
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja
- Ralph Castain (Intel)
- Sam Gutierrez (LANL)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- William Zhang (AWS)
- Xin Zhao (NVIDIA)
- Schedule: Just Released 4.0.7 Monday (Nov 15)
- Started a v4.0.8 milestone/Checklist
- Schedule: next (final?) rc later this week or next week?
- All outstanding PRs as of yesterday were merged.
- Bug about alltoallw with MPI_IN_PLACE
- Fixed in master, but has another issue.
- Brian and George working on PR for this.
- Will bring the whole Alltoallw series into v4.1.x
- Should also cleanup alltoallv issues in v5.0.x
- hcoll 9619 went back to v4.1.x yesterday.
- George saw some issues with MPI_Comm_Spawn intermittent hangs, will open an issue.
- IBM
_inter
- IBM
- Once all of Alltoall[v|w] fixes are merged to master and CPed back to v4.1.x, will roll another rc.
- Schedule: slipped to Q1, 2022
- A lot of fixes went into master but didn't get Cherry-Picked back.
- Austen will investigate and open PRs later this week.
- Jeff, Brian and others are working on PMIx/PRRTE integration.
- Sent an email to devel
- https://github.com/open-mpi/ompi/issues/9540 might be ready on v5.0.x
- 8 PRs open.
- PR 9594 - Fixes some BTL issues (against master) will take a few days to review.
- Issue #9554 Jeff asked about Partitions support going to v5.0 or not?
- Matthew is interested
- PR #9495 TCP Onesided for master.
- Tommy's still pushing on UCX Onesided.
- PR 9576 - Ralph filed a ticket about building packages externally.
- Working with fedora packagers. Will be a v5.0.x
- Might need some back and forth with PMIx. The way he updated PMIx might need massive change to OMPI.
- Ball is somewhat in Jeff's Court.
- Across OMPI/PMIx/PRRTE - Just need to
- MPI Info stuff that Yoseph and Howard are working on.
- Marking a few MPI_ calls as deprecated.
- Nevermind, Don't mark as deprecated, since we're not MPI 4.0 compliant, so DONT mark as deprecated yet.
- No additional discussion.
- Documentation
- Got a change in sphynx tools needed. No sure if there's a release yet.
- This fixes outputting issues in manpages.
- Process to update FAQ is to talk to Jeff or Harumi.
- Any changes in README or FAQ let them know to make changes in NEW docs.
- For now, make changes in ompi-www and README as usual and let them know.
- Got a change in sphynx tools needed. No sure if there's a release yet.
- Issue 9501 regression, needs to be fixed or reverted.
- No test for building from tarball, ensure we don't need pandoc.
- Github Project of [critical v5.0.x issues|https://github.com/open-mpi/ompi/projects/3]
- Issue #8983 If we partially disable OSC/TCP BTL - Not breaking MPI compliance, just breaking One-sided performance badly.
- https://github.com/open-mpi/ompi/pull/8984
- https://github.com/open-mpi/ompi/issues/7830
- users could fall back to using UCX or OFI, and not the BTLs.
- But that's a different can-of-worms
- Brian will take a look at issue.
- Described approach of rc1 on Sept 23, disabling any functionality that are blockers to allow for the rc.
- Worried that blockers might not be fixed in time, so will put in code to issue an error at runtime to prevent getting into those paths, and document it heavily.
- Issue #8983 If we partially disable OSC/TCP BTL - Not breaking MPI compliance, just breaking One-sided performance badly.
- RDMA Onesided might be stalled.
- He's identified the core issues
- A bunch of cleanup work, he's done about half of.
- Understand and have written down the problems.
- All BTL completion semantec stuff.
- Who has time.
- Regression and Silent data corruption.
- Would it be worth sending an email to devel list?
- Time and Date of BOF Nov 16 @ 12:15pm US Eastern Time.
- Everyone who's involved has been preparing for SC21_BOF slack channel.
- 140 people registered for Open MPI BoF, usually 75-100.
- Jeff will post PDFs of slides on
- Jeff will drive the slides.
- 3000 ish on-site. in
- Brian and Jeff are official reps for legal ownership
- Usually pinged in first quarter of the year.
- Todo item in Q1. Do an audit of infrastructure.
- Who has what permisisons, etc.
- Who owns DNS domains
- Few other resources, that someone is managing on their own, and we don't know until it breaks.
- Also consider consolidating because we have a lot of infrastructure.
- Document!
-
Reviewed and Approved against master: https://github.com/open-mpi/ompi/pulls?q=is%3Apr+is%3Aopen+base%3Amaster+review%3Aapproved
-
https://github.com/open-mpi/ompi/pull/8926 - not ready.
- Still need to fix Fortran bindings (temp arrays, not released on nonblocking arryas) to cleanup.
- Build issue might have been fixed by something else.
- Rebasing might fix it?
- George will take a look.
-
Awaiting Review: https://github.com/open-mpi/ompi/pulls?q=is%3Apr+is%3Aopen+base%3Amaster+review%3Anone
- A number of PRs being worked...
- Remind a few people on older ones.
- No update
- Don't do the old system, use this new system for v5.0.0
- [Open MPI 4.0 API Compliance Github Project|https://github.com/open-mpi/ompi/projects/2]
- Howard opened the project and discussed
- MPI_T events #8057 - This needs to be rebased and merged.
- ECP person at livermore. Either ask him or Howard will rebase.
- Sessions branch, don't want to merge into master until possibly v5.0.1 gets out.
- It will complicate things in finalize/initialize code.
- Looking okay.
- Cisco tests are reenabled again.
- IBM still seeing Onesided issues.
- Static builds are somewhat broken on master.
- If Pmix is staticly linked, and compiled with janson, will be broken.
- We're not always pulling in dependencies of our dependencies
- Revive MTT development montly meetings in January of 2022
- Last few days haven't been able to build with internal pmix.
- dynamic linking.
- If add -lpmix
- MPI_Comm_spawn - reportedly hanging.
- All IBM
_inter
tests do a Comm-Spawn (in same comm-world), THIS fails sometimes.- Blocked in pmix_get function.
- All IBM
- No discussion.