Skip to content

WeeklyTelcon_20160126

Geoff Paulsen edited this page Jan 26, 2016 · 12 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Geoffroy Vallee
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph Castain
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • 1.10.2 went out the Door.
  • Already have a bug (Giles) Ralph fixed.
  • Another bug Fortran - broken F08 bindings (Jeff) saw late last night.
  • Need to verify that library versions are still correct? -Jeff took care of.
  • MPI_Abort investigation (Ralph)? - Periodically have this issue where MPI_Abort + MTT has some issue. Perl is suspect, Ralph will look into ruby or another language.
  • 1.10 C Strided mutex lock issue. (Nathan)?
  • High CPU utilization on Async progress thread (Ralph)? Ralph Fixed... One off 1.10, not in master. In 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    1. Issue 1252 - Nathan's progression decay function progress? Looking at files today.
      • udcm, openib_error_handler - opal_outputs would be sufficent.
    2. Issue 1215 - Group Comm Errors thing (Ralph) - Deal with race condition in ORTE collectives.
      • Launch goes down the tree. Mutex goes across the tree.
      • So possible to receive a modex message before you receive launch message.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
  • Group Comms weren't working for Comms of powers of 2. (Nathan)? Fixed.
  • ROMIO default for OMPI on Luster (only) PR 896?
  • 894, 890, 900, 901 - Jeff and Howard are good with. Jeff?
    • Taking all of those merged.
  • Issue 1292 - Asked Ralph if this is right way to fix this. (Ralph)
  • Issue 1177 - large message writev, fixed but not merged to master - Test working everywhere but OS X / BSD (George).
    • OS X / BSD limits large message total size to 32K?
    • Not going to fix for 2.0.0
    • Someone can write code to handle OS X / BSD.
  • Issue 1299 - hang (Nathan)? Need to go ahead an fix this today. Giles has patch, Nathan just needs to verify.
  • 2.0.0 does not compile on Solaris due to statfs(). Now that we moved to OMPIO, we're now hitting the problem.
    • Edgar is working on it. Solaris has different number of args and return code.
  • Issue 1301 - check max CQ size before creating CQ. (Josh)
    • If it passes Jenkins, happy. UD OOB (Mellanox runs). Approved, Pending Jenkins.
  • HWThreads - Ralph? Talk to Mike about use case? A commit has been done, and moved to 1.10.
    • Pinged Giles that it should go to 2.0 also.
  • Travis Status on 2.0?
    • Going well.
  • Nathan is good with 2.0 for 1sided
  • PR918 - Ralph reviewed on master. Giles PRed it to 2.0.
  • PR919 - hwloc - Ralph will review
  • PR911 - use correct endpoint. Just got word from nVidia that this is good.
  • PR917 - Ryan will look at today. LANL hardware that hits this is going away. Doesn't affect Aries. Aries doesn't have get_alignment(). Want this in.

Review Master?

  • BTL flags = 305 perf got horrible? Edgar? Worked around by removing this on his cluster. Don't understand why. He always used to set it, but now doesn't.
  • OMPIO not finding PDFS2 - configure work Edgar is

MTT status:

  • Cisco was showing timeouts. Jeff found 2 things on cluster. Specific problem couldn't replicate.
    • not handling OOB on Master or 1.10. Cisco cluster 4 or 5 IP addresses on each node. eth0 was down on one node. Timeout on eth0 was taking quite a while. Jeff removed those two nodes. Unusual for real world. OOB verbosity exposes.
    • Long running problem, need a good solution.

Status Updates:

  • Cisco - Been working on Cluster, Release issues with Howard. have a couple of small scalability improvements for usNIC.
  • ORNL - Not much to report. Any progress with UBUNTU package ownership? Geoffroy will look on Saturday.
  • UTK - Not much to report.
  • NVIDIA - Sylvain not much, A user issue not finding CUDA. User got an error message in log, but job ran fine.

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally