-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20161213
Geoffrey Paulsen edited this page Jan 9, 2018
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Artem Polyakov
- Jeff Squyres
- Brian Barrett
- Howard
- Jimmy - SPI representative.
- Josh Hursey
- Josh Ladd
- Nathan Hjelm
- Ralph
- Todd Kordenbrock (HPE @ Sandia)
- Introductions.
- can we leverage their 501.3c Non-profit to leverage some status?
- One difference is that with SPI, Open MPI would remain our current legal status. Just associated with
- With Conservancy, Open MPI would be an activity of the Conservancy.
- Would be reasonable to request non-profit
- Github may be willing to add an organization to non-profit (SPI), they are willing to.
- Jimmy doesn't see a meaningful difference between SPI and Conservancy.
- If join withing 60 days of Nov 15th (Ralph is lesion).
- One difference is that with SPI, Open MPI would remain our current legal status. Just associated with
- Discussion
- Probably only need SPI services, Conservancy provides more.
- When started this process, neither organization would reply to Ralph for 6 months.
- If you join SPI, not becoming part of their organization.
- Conservancy would be happier to have us have more formal processes.
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.5
- Pressing need to release 1.10.5
- Waiting on PR from Nathan, then will create RC.
- Master fix is correct, but has to be back ported to 1.10.5.
- Nathan's users Want release by end of week.
- Waiting on PR from Nathan, then will create RC.
- Added regression test for darray bug.
- Mathias PSM2 not setting 1sided bits correctly.
- Pressing need to release 1.10.5
-
Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
-
Known / ongoing issues to discuss
- Darray Datatype issue - 2.0.2 - do a minor point release
- Early termination is not handled correctly - 2550 - Ralph fixed already. 2552, 2553 (jeff will cleanup)
- osc_pt2pt wrong answer - 2505.
- IBM has a 1 line fix. Mark thinks there is another issue in lock-all.
- Nathan: that sounds like it could be it. Can call Fence, but either in an epoch or not in an epoch. When you try to do a true extent, we return the wrong extent, and wrong lower bound. OMPI was seeing true
- IBM has a 1 line fix. Mark thinks there is another issue in lock-all.
-
PMIx update
- Last changes went in. Josh is rolling a new RC.
- Josh will update a PR for the v2.x branch.
- Should improve memory usage, but not yet ideal.
- Fuzzy, estimate for End of January.
- Strings on KNL are 40KB, and 80KB (per remote peer). This is not fixed in this RC.
- If we do compression, then have to do changes in OMPI. Currently clients don't free it. If we return
- Not sure if we want compression for all strings... for example hwloc output gets put into shared memory.
- Josh and Artem feels like mid-january. of PMIx 1.2 + integration in Open MPI v2.1.0.
- Fujitsu was excited about this change. Things should get much much better.
- Fujitsu gets credit for investigating how bad this issue was. Thanks!
- Artem has a PMIx perf tool (in contrib of PMIx srces). Measures memory consumption.
- Nathan's using MPI memory usage. Calls MPI_Init, does some collectives, and then reports process and node memory usage.
-
OMPI 2.1
- THE blocking issue is PMIx.
- Focus now is OMPI 2.0.2.
-
Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *
Review Master MTT testing (https://mtt.open-mpi.org/)
- No morning messages still. Need to pester Brian about. Apparently not allowed to make changes until after the new year.
- mail from our AWS instance is not getting to us.
- Biggest failures we saw in 2.0.x and 2.1.x
- OSHMEM - BTL fix, fixed a bunch of things, but still a few errors (Segv), Put or Get not registered location.
- Jeff will make a ticket for few remaining OSHMEM failures.
- OSHMEM - BTL fix, fixed a bunch of things, but still a few errors (Segv), Put or Get not registered location.
- Sylvain seeing a bunch of errors in master oob/ud components
- mostly timeouts. not sure if hanging, or really slow.
- Josh - turned on Jenkins testing at IBM, may result in timeouts. Using PGI on PPC64.
- Face to Face in January - https://github.com/open-mpi/ompi/wiki/Meeting-2017-01
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM