forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 4
WeeklyTelcon_20160614
Geoff Paulsen edited this page Jun 14, 2016
·
14 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Arm Patinyasakdikul
- Edgar Gabriel
- Howard
- Joshua Ladd
- Nathan Hjelm
- Ralph
- Ryan Grant
- Sylvain Jeaugey
- Todd Kordenbrock
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
- Appears to be ready to go, but there is this PSM signal issue, we'll discuss in new item,
- Dynamic Comm Spawn disconnecting. Maybe you need Free and disconnect.
- On disconnect, child is trying to send signal to parent and is getting an unreachable error.
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker *
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *
- Status:
- Merged "everything" in.
- Nathan was cherry picking them out of Master, and looks like we got everything.
- Will know in tonight's MTT.
- Issue with Propagation of Error code for persistant requests.
- Want to give Intel time to look at PSM PR to fix signal Handler chaining.
- People, look over NEWS PR for 2.0.0
- Split it into new sections.
- README is a bit behind of NEWS.
- Jeff generated a nightly tarball NOW, and could start MTT tests today.
- Nathan asked to accept PR for SKIF to reduce priority.
- Merged "everything" in.
-
Timing of v1.10.3 vs v2.0.0 releases
- coordination of NEWS bullets
-
PSM/PSM2 signal hijacking: fix for v1.10.x and v2.0.0
- Jeff filed PR - fixs SEGV. Make sure wording is good (mentioned vendor).
- PR looks in env, for either PSM or PSM2 variable. If env var is NOT set, it sets it to disabled.
- Default is don't do PSM backtrace files unless user asks for it via env var.
- in JNI onload they dlopen libmpi, so do it for this.
- Open MPI has always had a backtrace handler in Open MPI, and never understood where the signalhandler was failing.
- PR looks in env, for either PSM or PSM2 variable. If env var is NOT set, it sets it to disabled.
- For debugging in PSM and PSM2 libraries, getenv (var) register sigtraps for various handlers.
- In PSM2 they handled correctly to chain the signal handers, and put the old handlers back when they're done.
- Only reason need protection here for HFI - PSM2 library. Discovered a type-o in there at finalize it was resetting the signal handler for random point in memory.
- Intel Pushing fixes back, aiming for latest Fedora 25 (small windows) to eventually get picked up by RHEL 7.3?
- In PSM2 they handled correctly to chain the signal handers, and put the old handlers back when they're done.
- Jeff filed PR - fixs SEGV. Make sure wording is good (mentioned vendor).
-
Next developer’s meeting - Will create a doodle for time.
- Cisco Chicago by Ohare
- IBM DFW
- Cisco San Jose
-
Begin planning for 3.0 branch - There is a LOT of change in the master. Several pages of changes for 2.1.
- Perhaps in August timeframe branch for 3.0 since master and 2.x has diverged alot?
- well, maybe don't need to
- No known ABI break for 3.0, so we won't yet fork for 3.0.
- We need to figure out procedural issue to get code changes in, duel checkins to master and 2.x will be painful.
- Should look into possibly moving much of 2.x branch up to date with master??? That is a lot of risk!
- Perhaps in August timeframe branch for 3.0 since master and 2.x has diverged alot?
-
MTT development - A lot of development
- Do we need an MTT telecon for awhile (biweekly?) Ralph will do a doodle setup.
-
non-member access to ompi-tests -
- Asking to get tests to do testing. Seems great, but haven't yet given access to non-members.
- This case they are working on contributors agreement.
-
Open MPI - 2.0 testing down to about 0.2% of error rate on Jeff's runs.
-
Group Proc Count errors
- it'd be nice to group MTTs that are not common, sparce groups going into 2.1.
-
Would be nice to have a Known_Failure file of some sort.
-
Also would be nice to "group" certain tests (like via a tag) such that when all of them fail, it's easier to know "That's all of the MPI_Group_create tests".
Review Master MTT testing (https://mtt.open-mpi.org/)
- Still a lot of failures on master.
- Cray failures may be cluster issue, howard needs to look at.
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM