Meeting 2016 08

August 2016 Open MPI Developer's Meeting

Logistics:

Start: 9am, Tue Aug 16, 2016
Finish: 1pm, Thu Aug 18, 2016
Location: IBM facility, Dallas, TX
Attendance fee: $50/person, see registration link below

Location:

Exactly the same setup as in February
- IBM Dallas Innovation Center web site
- Google Maps
- Street address: 1177 South Beltline Road, Coppell, Texas 75019 USA
- Enter on the East entrance (closest to Beltline Road)
  - Hollerith Room - On left after you walk in. (Now All 3 days, in the same room!)
  - Receptionist should have nametags for everyone.
  - Foreign Nationals welcome.
  - No need to escort visitors in this area.

Attendees

Please both register at EventBrite ($50/person) and add your name to the wiki list below if you are coming to the meeting:

Jeff Squyres, Cisco
Howard Pritchard, LANL
Geoffrey Paulsen, IBM
Ralph Castain, Intel
George Bosilca, UTK (17 and 18)
Josh Hursey, IBM
Edgar Gabriel, UHouston
Takahiro Kawashima, Fujitsu
Shinji Sumimoto, Fujitsu
Brian Barrett, Amazon Web Services
Nathan Hjelm, LANL
Sameh Sharkawi, IBM (17 and 18)
Mark Allen, IBM
Josh Ladd, Mellanox (17)
...please fill in your name here if you're going to attend...

Topics Still To Discuss

Annual git committer audit
- Google spreadsheet
Plans for v2.1.0 release
- Need community to contribute what they want in v2.1.0
- Want to release by end of 2016 at the latest
Present information about IBM Spectrum MPI, processes, etc.
- May have PR's ready to discuss requested changes, but schedule is tight in July / August for us.
MTT updates / future direction
How to help alleviate "drowning in CI data" syndrome?
- One example: https://github.com/open-mpi/ompi/pull/1801
- One suggestion: should we actively market for testers in the community to help wrangle this stuff?
- If Jenkins detects an error, can we get Jenkins to retry the tests without the PR changes, and then compare the results to see if the PR itself is introducing a new error?
- How do we stabilize Jenkins to alleviate all these false positives?
PMIx roadmap discussions
Thread-safety design
- Need some good multi-threaded performance tests (per Nathan and Artem discussion)
  - Do we need to write them ourselves?
- Review/define the path forward
Fujitsu status
- Memory consumption evaluation
- MTT status
- PMIx status
Revive btl/openib memalign hooks?
Request completion callback and thread safety
Discuss appropriate default settings for openib BTL
- Email thread on performance conflicts between RMA/openib and SM/Vader
Ralph offers to give presentation on "Flash Provisioning of Clusters", if folks are interested
Cleanup of exposed internal symbols (see https://github.com/open-mpi/ompi/pull/1955)
Performance Regression tracking
- What do we want to track, and how are we going to do that.
- https://github.com/open-mpi/ompi/issues/1831#issuecomment-229520276
- https://github.com/open-mpi/mtt/issues/445
Symbol versioning
- Per request from Debian: https://github.com/open-mpi/ompi/pull/1955
- There's 3 issues:
  1. Symbol visibility. Per his PR, it looks like we're leaking a lot of symbols that do not need to be public.
  2. .so version numberinf for MPI and OSHMEM libraries. We have made promises about this; we just need to honor those promises.
  3. Symbol versioning.
    - Symbol versioning can fix the visibility problem, but that's really a side effect. We should just fix the visibility issue with proper use of DECLSPEC.
    - Per the MPI spec, we don't need to version the MPI API calls (because the MPI Forum will not break APIs)
    - Will it help to version things like MPI_Comm? (e.g., if we grow the size of the communicator struct)
- Fallout from this discussion: should we return to a single library so that the ORTE and OPAL symbols will not be exposed to users?
  - We have flip-flopped on 1 vs. 3 librarys multiple times.
  - From Feb 2016 meeting notes:
    - Beginning of the project: there was just libmpi. Later, it was split into projects, and then the project libraries. Later, the build was unified back into libmpi again.
    - In Dec 2012 (here's the commit), we split the build back into 3 libraries. The commit message cites discussion at the Dec 2012 Open MPI dev meeting -- but there's unfortunately no clues as to the rationale why this was done in the wiki notes. Was it just because we developers like having 3 smaller libraries? Or is there some deeper technical issue? Neither Ralph nor Jeff remembers. 😦
    - Rationale for bringing this up again: when upstream projects are trying to link against portions of our project, and then also support apps that link against all of it, we run into conflicts (e.g., the ORTE being used by the upstream project may be different than the one being used by the OMPI installation). Slurping it all up into one library -- and making only the MPI API be visible -- would resolve the problem. ...but we cannot recall if there are undesirable side-effects.
    - No issue with recombining libraries, need to look closer at whether ORTE/OPAL symbols still need to be public
What to do about MPI_Info PR from IBM / MPI Forum gyrations about MPI_Info?
Should we be using Slack.com as a community?

Topics Already Discussed

NOTE: Some notes are included below. But a much more detailed writeup can be found in the meeting minutes

Status of v2.0.1 release
- Lots of PRs still...
- From the meeting:
  - Closing in on v2.0.1. Most PRs are in. Release next Tuesday (Aug 23, 2016) if possible
After v2.1.0 release, should we merge from master to the v2.x branch?
- Only if there are no backwards compatibility issues (!)
- This would allow us to close the divergence/gap from master to v2.x, but keep life in the v2.x series (which is attractive to some organizations)
- Alternatively, we might want to fork and create a new 3.x branch.
- From the meeting:
  - Long discussion. There seems to be two questions:
    1. What to call the release after v2.1.x: v2.2.x or v3.x (i.e., whether there are backwards compatibility issues or not)
    2. Whether to merge master into the v2.x branch or fork into a new branch (regardless of whether the next release is v2.2.x or v3.x)
  - The consensus seems to be that we think (but we don't know for sure because no one has systematically analyzed) there is both:
    1. A huge amount of code drift from master to v2.x such that a merge may generate tons of conflicts
    2. A bunch of backwards-incompatible changes (e.g., MCA vars and CLI params)
  - Meaning: we think the next release should be v3.x and it should be a fork from master
Migration to new cloud services update for website, database, etc.
- DONE:
  - DNS:
    - All 6 domains transferred to Jeff's GoDaddy account
  - Web site:
    - Migrate www.open-mpi.org to HostGator
    - Install initial LetsEncrypt SSL certificates on www.open-mpi.oreg
    - Submit CSR to U Michigan for 3-year SSL certificates on www.open-mpi.org (thank you, U. Michigan!)
    - rsync web mirroring method shut down
  - Mailing lists:
    - Migrate mailman lists to NMC
    - Freeze old mailing list archives, add to ompi-www git
    - Add old mailing list archives to mail-archive.com
    - Setup new mails to archive to mail-archive.com
  - Email
    - Setup 2 email legacy addresses: rhc@ and jjhursey@
  - Infrastructure
    - Nightly snapshot tarballs being created on RHC's machine and SCPed to www.open-mpi.org
  - Github push notification emails (i.e., "gitdub")
    - Converted Ruby gitdub to PHP
    - Works for all repos... except ompi-www (due to memory constraints)
      - Might well just disable git commit emails for ompi-www
  - Contribution agreements
    - Stored in Google Drive under [email protected] (and shared to a few others)
- Still to-do:
  - Web site:
    - Probably going to shut down the mirroring problem.
    - Possibly host the tarballs at Amazon S3 and put CloudFront in front of them
  - Spin up an Amazon EC instance (thank you Amazon!) for:
    - Hosting Open MPI community Jenkins master
    - Hosting Open MPI community MTT database and web server
  - Revamp / consolidate: ompi master:contrib/ -- there's currently 3 subdirs that should really be disambiguated and overlap removed. Perhaps name subdirs by the DNS name where they reside / operate?
    - infrastructure
    - build server
    - nightly
  - Spend time documenting where everything is / how it is setup
  - Fix OMPI timeline page: https://www.open-mpi.org/software/ompi/versions/timeline.php
    - Gilles submitted a PR: https://github.com/open-mpi/ompi-www/pull/14
    - DONE!
  - Possible umbrella non-profit organization
    - Details to be mailed to devel-core/admin: see http://spi-inc.org/
  - Update Open MPI contrib agreements
    - Created a new contributions@lists. email address, will update agreements
MCA support as a separate package?
- Now that we have multiple projects (PMIx) and others using MCA plugins, does it make sense to create a separate repo/package for MCA itself? Integrating MCA into these projects was modestly painful (e.g., identifying what other infrastructure - such as argv.h/c - needs to be included) - perhaps a more packaged solution will make it simpler.
- Need to "tag" the component libraries with their project name as library confusion is becoming more prevalent as OMPI begins to utilize MCA-based packages such as PMIx
- From the meeting:
  - The need for this has gone down quite a bit: PMIx copied and renamed, Warewulf is going to go python.
  - But it seems worthwhile to take the next few steps in spreading the project name throughout the MCA system:
    - Put the project name in the component filename: mca_PROJECT_FRAMEWORK_COMPONENT.la
    - Add some duplicate-checking code in the MCA var base: if someone sets a value for FRAMEWORK_COMPONENT_VAR, and there's more than one of those (i.e., the same framework/component/var in two different projects, and the project name was not specified), the we need to error and let a human figure it out.
Plans for folding ompi-release Github repo back into ompi Github repo
- https://github.com/open-mpi/ompi/issues/1512
(Possibly) Remove atomics from OBJ_RETAIN/OBJ_RELEASE in the THREAD_SINGLE case.
- @nysal said he would look at this.
- See https://github.com/open-mpi/ompi/issues/1902.
- NTH: This is already done.
Continue --net mpirun CLI option discussion from Feb 2016 meeting
- Originally an IBM proposal.
- Tied to issues of "I just want to use network X" user intent, without needing to educate users on the complexities of PML, MTL, BTL, COLL, ...etc.
- We didn't come to any firm conclusions in February.
- From the meeting:
  - There was a long discussion about this in the meeting; see the meeting minutes for more detail.
MPI_Reduce_Local - move into coll framework.
- From the meeting:
  - It isn't in the coll framework already simply because it isn't a collective.
  - But IBM would like to have multiple backends to MPI_REDUCE_LOCAL
  - The OMPI Way to do this is with a framework / component
  - Seems like overkill to have a new framework just for this one MPI function
  - So it seems ok to add it to the coll framework

Presentation Material

Fujitsu Status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly