Skip to content

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

License

Notifications You must be signed in to change notification settings

collinabidi/horovod

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Horovod

Logo



Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

This version of Horovod has SHMEM enabled and was built by Collin Abidi at NSF SHREC, University of Pittsburgh.

Dependencies

These are versions of packages/libraries that are necessary/known to work when using Shmorovod. Python packages are installed using pip.

Horovod is hosted by the LF AI & Data Foundation (LF AI & Data). If you are a company that is deeply committed to using open source technologies in artificial intelligence, machine, and deep learning, and want to support the communities of open source projects in these domains, consider joining the LF AI & Data Foundation. For details about who's involved and how Horovod plays a role, read the Linux Foundation announcement.


Documentation

Horovod: 0.19.2

OpenMPI: 4.0.3

UCX

OpenSHMEM: 1.4 (included with OpenMPI >= 4.0)

gcc: 8.0.3

torch: 1.7.0

torchvision: 0.8.0

tensorflow: 1.x.x, 2.0.0

h5py: 2.10.0

cffi: 1.14.3

cloudpickle: 1.6.0

Install

  1. Install CMake

  1. If you've installed TensorFlow from PyPI, make sure that the g++-4.8.5 or g++-4.9 or above is installed.

    If you've installed PyTorch from PyPI, make sure that the g++-4.9 or above is installed.

    If you've installed either package from Conda, make sure that the gxx_linux-64 Conda package is installed.

  1. Install OpenMPI and UCX with OpenMPI following the instructions at https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX.

  2. Install SHMEM-based Horovod from source.

    Download repository from GitHub.

    $ git clone --recursive https://github.com/collinabidi/horovod

Build

  1. Enable PyTorch and/or TensorFlow. Modify the build_mpi.sh and build_shmem.sh scripts to include the proper flags. If you want to build with PyTorch, make sure that HOROVOD_WITH_PYTORCH=1 is in each of the lines of build_mpi.sh and build_shmem.sh. If you want to build with TensorFlow, make sure that HOROVOD_WITH_TENSORFLOW=1 is in each of the lines in build_mpi.sh and build_shmem.sh. If you want to build without one, add the HOROVOD_WITHOUT_TENSORFLOW=1 or HOROVOD_WITHOUT_PYTORCH=1 flags.

    $ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod

For more details on installing Horovod with GPU support, read Horovod on GPU.

For the full list of Horovod installation options, read the Installation Guide.

If you want to use MPI, read Horovod with MPI.

If you want to use Conda, read Building a Conda environment with GPU support for Horovod.

  1. Build Horovod with MPI or SHMEM.

    To build Horovod with MPI enabled, run the build_mpi.sh script.

    $ ./build_mpi.sh

    To build Horovod with SHMEM enabled, run the build_shmem.sh script.

    $ ./build_shmem.sh

Usage

  1. Run Horovod with SLURM.

    If you use SLURM to submit jobs, simply modify the included SLURM script to fit you cluster's configuration. Make sure to correctly load your environment before executing anything.

    $ sbatch hvd_test_2.slurm
  2. Run Horovod without SLURM.

    If you have admin access to your cluster, you can copy the SLURM script into a shell script, remove the variables at the top, and execute normally using the horovodrun or oshrun (SHMEM-specific) commands.

    The following is an example of running the included pytorch_mnist.py script included in the example folder on 2 nodes (denoted by the -np 2 argument). SHMEM-enabled version of Horovod has several necessary command-line arguments that may vary from system-to-system.

    $ oshrun -np 2 -x --mca mpi_cuda_support 0 \
      --mca pml ucx --mca osc ucx \
      --mca atomic ucx --mca orte_base_help_aggregate 0 \
      --mca btl ^vader,tcp,openib,uct python3 pytorch_basic.py --epochs 1 --no-cuda
     $ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
  3. To run using Open MPI without the horovodrun wrapper, see Running Horovod with Open MPI.

  4. To run in Docker, see Horovod in Docker.

  5. To run in Kubernetes, see Kubeflow, MPI Operator, Helm Chart, FfDL, and Polyaxon.

  6. To run on Spark, see Horovod on Spark.

  7. To run on Ray, see Horovod on Ray.

  8. To run in Singularity, see Singularity.

  9. To run in a LSF HPC cluster (e.g. Summit), see LSF.

Guides

  1. Run distributed training in Microsoft Azure using Batch AI and Horovod.
  2. Distributed model training using Horovod.

Send us links to any user guides you want to publish on this site

Troubleshooting

See Troubleshooting and submit a ticket if you can't find an answer.

Citation

Please cite Horovod in your publications if it helps your research:

@article{sergeev2018horovod,
  Author = {Alexander Sergeev and Mike Del Balso},
  Journal = {arXiv preprint arXiv:1802.05799},
  Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}},
  Year = {2018}
}

Publications

1. Sergeev, A., Del Balso, M. (2017) Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow. Retrieved from https://eng.uber.com/horovod/

2. Sergeev, A. (2017) Horovod - Distributed TensorFlow Made Easy. Retrieved from https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy

3. Sergeev, A., Del Balso, M. (2018) Horovod: fast and easy distributed deep learning in TensorFlow. Retrieved from arXiv:1802.05799

References

The Horovod source code was based off the Baidu tensorflow-allreduce repository written by Andrew Gibiansky and Joel Hestness. Their original work is described in the article Bringing HPC Techniques to Deep Learning.

Getting Involved

About

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 65.6%
  • C++ 31.3%
  • CMake 1.7%
  • Other 1.4%