Skip to content

ML-Perf HPC WG Implementation of Mesh-Tensorflow and (buildscripts) for Tensorflow with MPI

Notifications You must be signed in to change notification settings

jbalma/mesh-transformer-mpi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mesh-transformer-mpi

This repo is to track progress of building Tensorflow with MPI support.

The goal is to ensure Tensorflow maps network communication operations to MPI calls, such that the ops used by mesh-tensorflow get mapped to MPI calls as well.

./common-problems.txt: Common problems I've hit while trying to against MPI. Directions to changing tensorflow/configure.py ./tests: Tests for making sure calls are actually using mpi ./runscripts: Launch scripts for Mesh-Tensorflow

Step 1) Setup Configure

-> git clone tensorflow
-> git checkout r1.13
-> If using Cray-MPICH, cp configure.py tensorflow/
-> If using OpenMPI, use default configure.py

Step 2) Build it

-> Edit the paths in buildit.sh to point to the correct MPI dirs and bazel temp dirs
-> run buildit.sh like: ./buildit.sh
-> Answer the questions about the path to Python
-> Answer Yes (y) to the question regarding MPI build support, no additional input should be required for that item if you setup the path correctly

Step 3) Fix common issues are having undeclared dependencies for MPI headers -> Most can be fixed by editing the bazel build files: -> tensorflow/contrib/mpi/BUILD -> tensorflow/contrib/mpi_collectives/BUILD -> Declare dependencies where needed. This has only been necessary when headers outside of mpi.h are required (e.g. MPICH)

Step 4) Once the build completes, do: -> pip install mesh-tensorflow --user -> cd /tests -> Edit runit.sh to point to the correct directories (requires srun / SLURM) -> If you build on Cray XC, the perftools-lite module should be loaded -> Once the test run completes, perftools should produce a profile which shows MPI calls map to tensorflow ops

About

ML-Perf HPC WG Implementation of Mesh-Tensorflow and (buildscripts) for Tensorflow with MPI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published