Skip to content

WeeklyTelcon_20160419

Geoff Paulsen edited this page Apr 19, 2016 · 5 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

  • Date: April 19, 2016

Attendees

  • Todd Kordenbrock
  • Geoff Paulsen
  • Jeff Squyres
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph
  • Sylvain Jeaugey

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
    • Predefined datatype test still failing on 1.10.
    • Attribute tests keep failing (Cisco)
      • cxxWinAddr test still failing.
    • OMNI-path issue. Ralph supplied a matias.diff patchfile, but Howard could not get it to work.
      • issue seems to be with PSM2 on 1 node creating endpoints.
    • Error signal handler in psm libraries. Nothing we can do at OMPI layer.
    • Next 1.10 release. need to fix these issues, but looking like early May.
    • Jeff and Ralph proposing Stopping the 1.10 series after this release, if we can get 2.0 out.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

    • Still at 1 remaining blocker: to memory symbol patcher - Nathan / IBM / Mellanox.
      • Everything looking good, but FreeBSD offers a slightly different ELF format in their elf.h
      • Nathan disable for FreeBSD. Folks just using it for dev / TCP. no RDMA.
      • Absoft - hitting a compiler error with GCC 4.1.2.
      • On some systems, something is overriding the default. most customers on x86 or Power so, not biggie
        • Fujitsu will care someday
      • Everyone happy? yes because they all stack with UCX.
      • Howard did you add a test to MTT to stress this code path? - Mark gave Nathan a test case he added to IBM
      • Extra optional to mremap() on Linux. When MREMAPFIXED
        • Nathan made the last argument explicit.
      • SPARC still having issues, so will need a solution for 2.0.1.
        • Have some time since fujitsu isn't moving to 2.x until later this year.
        • Will disable leave-pinned on sparc.
      • Nathan will work to remove ptmalloc on master
        • Checked into Master.
      • on 2.0.0 Nathan will add a --enable-ptmalloc explicit configure option, but doesn't build by default.
        • If users configure --enable-ptmalloc, then it would disable the internal memhook frameworks entirely.
        • when this happens, will have to add some early code to tickle ptmalloc
        • need to document that if --enable-ptmalloc then munmap() calls may give wrong answers.
      • Nathan will look at README for memory hook stuff.
      • The Late opening of mpool has been held off, because ptmalloc still optional on 2.0.0
        • rest of master stuff will be pulled over in 2.1.0
      • Nathan if he can do NEWS, that'd be great, otherwise Jeff and Howard will get it in.
      • OPENFABRICS should get it's act together and put in something in kernel to alleviate ll of this.
    • Question, do we want new prettier ompi_info output. Didn't change parsable output.
      • Low risk, got contributor agreement (works for SuSE). Can pull 1515, 1516, 1518 into 2.0.
    • Timeframe? If nathan gets stuff in today, then will make an RC in next couple of days.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *

Review Master?

  • Started looking kinda clean other than other stuff.
    • Allgather issue on intra-collectives.

New discussion

  • Merge GITHUB projects/repos into one with master and release branches, with restricted permissions on release branches.
    • After 2.0.0 release, will look at logistics of merging two repos.
    • delayed - Permissions for tagging, and pushing branches? etc?
    • delayed - Always use PULL request? or keep master so anyone can push?

MTT status:

  • Noah at Intel - added some features to python client. Fixed slurm.
  • IBM has a client facing cluster
    • IBM Cluster submitting to Open cluster
    • Switch these over to be publicly viewable this week.
    • Power 8 set of machines.
    • will also have LSF. Has LSF, had some issues
    • Hopefully will have Jenkin's pull requesting working.
      • An issue with where credentials have to live. Josh is negotiating.
      • Having access to that version of Jenkins has been useful.
    • If 'we' can't see console output. Mellanox community CAN see console output.
      • Community would not be happy if can't see output.
      • Had an idea to push output to a GITHUB jist
    • Working on getting Jenkins setup on IBM side to ensure Pull Requests get tested on Power also.
    • Hoping to have online this week?
  • Looking at establishing MTT release .tarballs.
    • Intel guys are looking at how to "release" MTT. MTT does not lend itself well to a "release".
    • Because they want to include it in HPC Cluster testing project.
  • How will PMIx 1.1.4 be moved into 2.0.1?

Status Updates:

  1. Mellanox - nothing other than the usual scrambling, Hopefully got the OSHMEM issues resolved.
    • Do you know if once those things go into master, if they will be v1.3 compliant?
    • Multi-subnet routing been delayed to 2.1.
    • Josh instituted some additional processes and proceedures for his team.
    • HPC / Mellanox remove Mike Dubman and use Josh Ladd
  2. Sandia - add the moment a dependency in One Sided Component.
    • Looking at creating an OMPI Portals component for all of that to exist in.
    • Have a collective component with a handful of Portals collectives. Figuring out which best ones to add.
    • When to add new collective components? 2.0.1 or 2.1?
      • 2.0.1 would be focused on bug fixes.
      • 2.1.0 will include both features / bugfixes.
  3. Intel - PMIx stuff.
    • Started SCON - Scalable Overlay Network Project.
      • eventually make it a configurable option to the PMIx library (for PMIx to use for communication)
      • will have ____ lots of components, and collective options.
      • Would port whatever makes sense to ORTE.
    • Working with HPC stack guys, for communicating different elements, etc

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally