-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20170207
Geoffrey Paulsen edited this page Jan 9, 2018
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Artem Polyakov
- Edgar Gabriel
- Howard
- josh Hursey
- Joshua Ladd
- Todd Kordenbrock
- Thomas Naughton
- Nathan Hjelm
- Ryan Grant
Review All Open Blockers
Review Milestones v1.10.6
- Ralph put in the approved stuff this morning.
- Still 7 PRs that need review.
- No schedule yet.
- Want to check that 2678 doesn't impact 1.10, but think it might.
Review Milestones v2.0.3
- [PR 2593|https://github.com/open-mpi/ompi/pull/2593] - osc_pt2pt previous locks must complete in order, not sure what's correct in standard.
- Nathan - code in question does something, and does a deadlock. Gist of test code.
- Nathan - low on priority list. osc_rdma succeeds (uses atomics instead of osc_pt2pt locks).
- Nathan - not opposed to the patch, but wants to understand.
Review Milestones v2.1.0
- Nathan trying to get OMPI 2.1 to launch at scale this week.
-
PR2932 - would cause PMIx to use dstore by default in v2.1 to match master.
- Nathan knows there is a bug in dstore. It stomps memory. v2.0.x launches okay at scale, but master does not. Possibly related to this.
- It's just add_procs - only happens in 5%-10% of ranks. offset was 0, data segment was 0.
- Artem: Lets use Artem's patch to keep those ranks running so we can attach and see.
- Nathan ran STAT, and showed where it was.
- We all agreed we WANT dstore in OMPI v2.1, but don't want to merge this PR until we have dstore fixed.
- Artem is actively working it.
- 1ppn launch is pretty good, but 64ppn will crash or do bad things because of dstore overwrite.
- Everything else is about bugging folks for reviews.
- Yesterday we started looking at them. Some need rebasing, and many need reviews.
- Schedule - the gating feature is PMIx. and dstore is also a blocker.
- held up because of this dstore / shared memory ppn scaling issue.
- Is the dstore similar enough between PMIx - v1.1.2 and v1.2.1?
- pretty close, but cherry-picks are not clean.
- related to the problem we are solving.
- Hopeful to have a PMIx v1.2.1 RC rolled this week, and PR this into Open MPI v2.1 late this week, or next week.
Review Master Pull Requests
- Artem opened a PR to resolve the dstore problem. Would like to know what happens next with it.
- MPI_Spawn problem
- Slyvain was seeing MTT failures with very similar error.
- Jeff opened an PR 2925, do we need to wait until Ralph to review?
- Josh Hursey will review.
- Jeff could narrow this down, because he wasn't seeing it last week.
- Don't know if this affects Open MPI v2.1.x
Review Master MTT testing
- Jeff's MTT tonight is running slower than it usually does. So far doing all master things.
- Don't know if something went in in past day or 3, that would cause a noticible slowdown over 10,000 tests.
- MAC OS X - only 4 processors. Right now only building.
- Are you doing MTT on OS X? - plan to.
- /tmpdir issue - orte is generating a path into /var/lib/tmp - path is too big, so have to manually set tempdir.
- Apple's problem, but we need a work around.
- How long can travis take? Any reason to keep botnybay?
- Like to see Amazon AWS come online first, before we turn off linux side of travis.
- Travis - paid service is $130 / month = 1560.
- Just like github enterprise, there is travis enterprise.
- Ask friends to see if any open source projects / SPI is using Travis enterprise.
- Just like github enterprise, there is travis enterprise.
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu