Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
This version of Horovod has SHMEM enabled and was built by Collin Abidi at NSF SHREC, University of Pittsburgh.
These are versions of packages/libraries that are necessary/known to work when using Shmorovod. Python packages are installed using pip
.
Horovod is hosted by the LF AI & Data Foundation (LF AI & Data). If you are a company that is deeply committed to using open source technologies in artificial intelligence, machine, and deep learning, and want to support the communities of open source projects in these domains, consider joining the LF AI & Data Foundation. For details about who's involved and how Horovod plays a role, read the Linux Foundation announcement.
Horovod: 0.19.2
OpenMPI: 4.0.3
UCX
OpenSHMEM: 1.4
(included with OpenMPI >= 4.0
)
gcc: 8.0.3
torch: 1.7.0
torchvision: 0.8.0
tensorflow: 1.x.x
, 2.0.0
h5py: 2.10.0
cffi: 1.14.3
cloudpickle: 1.6.0
- Install CMake
If you've installed TensorFlow from PyPI, make sure that the
g++-4.8.5
org++-4.9
or above is installed.If you've installed PyTorch from PyPI, make sure that the
g++-4.9
or above is installed.If you've installed either package from Conda, make sure that the
gxx_linux-64
Conda package is installed.
Install OpenMPI and UCX with OpenMPI following the instructions at https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX.
Install SHMEM-based Horovod from source.
Download repository from GitHub.
$ git clone --recursive https://github.com/collinabidi/horovod
Enable PyTorch and/or TensorFlow. Modify the
build_mpi.sh
andbuild_shmem.sh
scripts to include the proper flags. If you want to build with PyTorch, make sure thatHOROVOD_WITH_PYTORCH=1
is in each of the lines ofbuild_mpi.sh
andbuild_shmem.sh
. If you want to build with TensorFlow, make sure thatHOROVOD_WITH_TENSORFLOW=1
is in each of the lines inbuild_mpi.sh
andbuild_shmem.sh
. If you want to build without one, add theHOROVOD_WITHOUT_TENSORFLOW=1
orHOROVOD_WITHOUT_PYTORCH=1
flags.$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
For more details on installing Horovod with GPU support, read Horovod on GPU.
For the full list of Horovod installation options, read the Installation Guide.
If you want to use MPI, read Horovod with MPI.
If you want to use Conda, read Building a Conda environment with GPU support for Horovod.
Build Horovod with MPI or SHMEM.
To build Horovod with MPI enabled, run the
build_mpi.sh
script.$ ./build_mpi.sh
To build Horovod with SHMEM enabled, run the
build_shmem.sh
script.$ ./build_shmem.sh
Run Horovod with SLURM.
If you use SLURM to submit jobs, simply modify the included SLURM script to fit you cluster's configuration. Make sure to correctly load your environment before executing anything.
$ sbatch hvd_test_2.slurm
Run Horovod without SLURM.
If you have admin access to your cluster, you can copy the SLURM script into a shell script, remove the variables at the top, and execute normally using the
horovodrun
oroshrun
(SHMEM-specific) commands.The following is an example of running the included
pytorch_mnist.py
script included in theexample
folder on 2 nodes (denoted by the-np 2
argument). SHMEM-enabled version of Horovod has several necessary command-line arguments that may vary from system-to-system.$ oshrun -np 2 -x --mca mpi_cuda_support 0 \ --mca pml ucx --mca osc ucx \ --mca atomic ucx --mca orte_base_help_aggregate 0 \ --mca btl ^vader,tcp,openib,uct python3 pytorch_basic.py --epochs 1 --no-cuda $ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
To run using Open MPI without the
horovodrun
wrapper, see Running Horovod with Open MPI.To run in Docker, see Horovod in Docker.
To run in Kubernetes, see Kubeflow, MPI Operator, Helm Chart, FfDL, and Polyaxon.
To run on Spark, see Horovod on Spark.
To run on Ray, see Horovod on Ray.
To run in Singularity, see Singularity.
To run in a LSF HPC cluster (e.g. Summit), see LSF.
- Run distributed training in Microsoft Azure using Batch AI and Horovod.
- Distributed model training using Horovod.
Send us links to any user guides you want to publish on this site
See Troubleshooting and submit a ticket if you can't find an answer.
Please cite Horovod in your publications if it helps your research:
@article{sergeev2018horovod, Author = {Alexander Sergeev and Mike Del Balso}, Journal = {arXiv preprint arXiv:1802.05799}, Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}}, Year = {2018} }
1. Sergeev, A., Del Balso, M. (2017) Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow. Retrieved from https://eng.uber.com/horovod/
2. Sergeev, A. (2017) Horovod - Distributed TensorFlow Made Easy. Retrieved from https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy
3. Sergeev, A., Del Balso, M. (2018) Horovod: fast and easy distributed deep learning in TensorFlow. Retrieved from arXiv:1802.05799
The Horovod source code was based off the Baidu tensorflow-allreduce repository written by Andrew Gibiansky and Joel Hestness. Their original work is described in the article Bringing HPC Techniques to Deep Learning.
- Community Slack for collaboration and discussion
- Horovod Announce for updates on the project
- Horovod Technical-Discuss for public discussion