Skip to content

WeeklyTelcon_20160126

Geoff Paulsen edited this page Jan 26, 2016 · 12 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Geoffroy Vallee
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph Castain
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • 1.10.2 went out the Door.
  • Already have a bug (Giles) Ralph fixed.
  • Another bug Fortran - broken F08 bindings (Jeff) saw late last night.
  • Need to verify that library versions are still correct? -Jeff took care of.
  • MPI_Abort investigation (Ralph)? - Periodically have this issue where MPI_Abort + MTT has some issue. Perl is suspect, Ralph will look into ruby or another language.
  • 1.10 C Strided mutex lock issue. (Nathan)?
  • High CPU utilization on Async progress thread (Ralph)? Ralph Fixed... One off 1.10, not in master. In 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    1. Issue 1252 - Nathan's progression decay function progress? Looking at files today.
      • udcm, openib_error_handler - opal_outputs would be sufficent.
    2. Issue 1215 - Group Comm Errors thing (Ralph) - Deal with race condition in ORTE collectives.
      • Launch goes down the tree. Mutex goes across the tree.
      • So possible to receive a modex message before you receive launch message.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
  • Group Comms weren't working for Comms of powers of 2. (Nathan)? Fixed.
  • ROMIO default for OMPI on Luster (only) PR 896?
  • 894, 890, 900, 901 - Jeff and Howard are good with. Jeff?
    • Taking all of those merged.
  • Issue 1292 - Asked Ralph if this is right way to fix this. (Ralph)
  • Issue 1177 - large message writev, fixed but not merged to master - Test working everywhere but OS X / BSD (George).
    • OS X / BSD limits large message total size to 32K?
    • Not going to fix for 2.0.0
    • Someone can write code to handle OS X / BSD.
  • Issue 1299 - hang (Nathan)? Need to go ahead an fix this today. Giles has patch, Nathan just needs to verify.
  • 2.0.0 does not compile on Solaris due to statfs(). Now that we moved to OMPIO, we're now hitting the problem.
    • Edgar is working on it. Solaris has different number of args and return code.
  • Issue 1301 - check max CQ size before creating CQ. (Josh)?
  • HWThreads - Ralph? Talk to Mike about use case? A commit has been done, and moved to 1.10.
    • Pinged Giles that it should go to 2.0 also.
  • Travis Status on 2.0?
    • Going well.

Review Master?

  • BTL flags = 305 perf got horrible? Edgar? Worked around by removing this on his cluster. Don't understand why. He always used to set it, but now doesn't.
  • OMPIO not finding PDFS2 - configure work Edgar
  • PR911 - use correct endpoint. Just got word from nVidia that this is good.
  • PR917 - Ryan will look at today. LANL hardware that hits this is going away. Doesn't affect Aries. Aries doesn't have get_alignment().

MTT status:

  • Cisco was showing timeouts.

Status Updates:

  • Cisco
  • ORNL
  • UTK
  • NVIDIA

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally