forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 4
WeeklyTelcon_20160126
Geoff Paulsen edited this page Jan 26, 2016
·
12 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Brad Benton
- Edgar Gabriel
- Geoffroy Vallee
- Joshua Ladd
- Nathan Hjelm
- Ralph Castain
- Ryan Grant
- Sylvain Jeaugey
- Todd Kordenbrock
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
- 1.10.2 went out the Door.
- Already have a bug (Giles) Ralph fixed.
- Another bug Fortran - broken F08 bindings (Jeff) saw late last night.
- Issue https://github.com/open-mpi/ompi/issues/1323
- If it's broken, how did it pass testing? Jeff needs a day or two to dig into.
- Need to verify that library versions are still correct? -Jeff took care of.
- MPI_Abort investigation (Ralph)? - Periodically have this issue where MPI_Abort + MTT has some issue. Perl is suspect, Ralph will look into ruby or another language.
- 1.10 C Strided mutex lock issue. (Nathan)?
- High CPU utilization on Async progress thread (Ralph)? Ralph Fixed... One off 1.10, not in master. In 1.10.2
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- Issue 1252 - Nathan's progression decay function progress? Looking at files today.
- udcm, openib_error_handler - opal_outputs would be sufficent.
- Issue 1215 - Group Comm Errors thing (Ralph) - Deal with race condition in ORTE collectives.
- Launch goes down the tree. Mutex goes across the tree.
- So possible to receive a modex message before you receive launch message.
- Issue 1252 - Nathan's progression decay function progress? Looking at files today.
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
- Group Comms weren't working for Comms of powers of 2. (Nathan)? Fixed.
- ROMIO default for OMPI on Luster (only) PR 896?
- 894, 890, 900, 901 - Jeff and Howard are good with. Jeff?
- Taking all of those merged.
- Issue 1292 - Asked Ralph if this is right way to fix this. (Ralph)
- Issue 1177 - large message writev, fixed but not merged to master - Test working everywhere but OS X / BSD (George).
- OS X / BSD limits large message total size to 32K?
- Not going to fix for 2.0.0
- Someone can write code to handle OS X / BSD.
- Issue 1299 - hang (Nathan)? Need to go ahead an fix this today. Giles has patch, Nathan just needs to verify.
- 2.0.0 does not compile on Solaris due to statfs(). Now that we moved to OMPIO, we're now hitting the problem.
- Edgar is working on it. Solaris has different number of args and return code.
- Issue 1301 - check max CQ size before creating CQ. (Josh)
- If it passes Jenkins, happy. UD OOB (Mellanox runs). Approved, Pending Jenkins.
- HWThreads - Ralph? Talk to Mike about use case? A commit has been done, and moved to 1.10.
- Pinged Giles that it should go to 2.0 also.
- Travis Status on 2.0?
- Going well.
- Nathan is good with 2.0 for 1sided
- PR918 - Ralph reviewed on master. Giles PRed it to 2.0.
- PR919 - hwloc - Ralph will review
- PR911 - use correct endpoint. Just got word from nVidia that this is good.
- PR917 - Ryan will look at today. LANL hardware that hits this is going away. Doesn't affect Aries. Aries doesn't have get_alignment(). Want this in.
- BTL flags = 305 perf got horrible? Edgar? Worked around by removing this on his cluster. Don't understand why. He always used to set it, but now doesn't.
- OMPIO not finding PDFS2 - configure work Edgar is
- Cisco was showing timeouts. Jeff found 2 things on cluster. Specific problem couldn't replicate.
- not handling OOB on Master or 1.10. Cisco cluster 4 or 5 IP addresses on each node. eth0 was down on one node. Timeout on eth0 was taking quite a while. Jeff removed those two nodes. Unusual for real world. OOB verbosity exposes.
- Long running problem, need a good solution.
- Cisco - Been working on Cluster, Release issues with Howard. have a couple of small scalability improvements for usNIC.
- ORNL - Not much to report. Any progress with UBUNTU package ownership? Geoffroy will look on Saturday.
- UTK - Not much to report.
- NVIDIA - Sylvain not much, A user issue not finding CUDA. User got an error message in log, but job ran fine.
- Looking at some error
- Discussion about configure summary at end of configure?
- 67 Frameworks and over 200 components.
- 6 Major Frameworks: RAZ, PLM, PML, MTL, BTL, OOB.
- Could someone moch up what they'd like to see.
- Leary of runtime environment. Moab on top of SLURM, then there are env vars that are not job related.
- ompi_info to lookup how it was built.
- --with behavior to help also.
- Decided to add to Feb Face2Face.
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, HLRS, IBM