Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

This version of Horovod has SHMEM enabled and was built by Collin Abidi at NSF SHREC, University of Pittsburgh.

Dependencies

These are versions of packages/libraries that are necessary/known to work when using Shmorovod. Python packages are installed using pip.

Horovod is hosted by the LF AI & Data Foundation (LF AI & Data). If you are a company that is deeply committed to using open source technologies in artificial intelligence, machine, and deep learning, and want to support the communities of open source projects in these domains, consider joining the LF AI & Data Foundation. For details about who's involved and how Horovod plays a role, read the Linux Foundation announcement.

Documentation

Horovod: 0.19.2

OpenMPI: 4.0.3

UCX

OpenSHMEM: 1.4 (included with OpenMPI >= 4.0)

gcc: 8.0.3

torch: 1.7.0

torchvision: 0.8.0

tensorflow: 1.x.x, 2.0.0

h5py: 2.10.0

cffi: 1.14.3

cloudpickle: 1.6.0

Install

Install CMake

If you've installed TensorFlow from PyPI, make sure that the g++-4.8.5 or g++-4.9 or above is installed.

If you've installed PyTorch from PyPI, make sure that the g++-4.9 or above is installed.

If you've installed either package from Conda, make sure that the gxx_linux-64 Conda package is installed.

Install OpenMPI and UCX with OpenMPI following the instructions at https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX.
Install SHMEM-based Horovod from source.

Download repository from GitHub.
```
$ git clone --recursive https://github.com/collinabidi/horovod
```

Build

Enable PyTorch and/or TensorFlow. Modify the build_mpi.sh and build_shmem.sh scripts to include the proper flags. If you want to build with PyTorch, make sure that HOROVOD_WITH_PYTORCH=1 is in each of the lines of build_mpi.sh and build_shmem.sh. If you want to build with TensorFlow, make sure that HOROVOD_WITH_TENSORFLOW=1 is in each of the lines in build_mpi.sh and build_shmem.sh. If you want to build without one, add the HOROVOD_WITHOUT_TENSORFLOW=1 or HOROVOD_WITHOUT_PYTORCH=1 flags.

$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod

For more details on installing Horovod with GPU support, read Horovod on GPU.

For the full list of Horovod installation options, read the Installation Guide.

If you want to use MPI, read Horovod with MPI.

If you want to use Conda, read Building a Conda environment with GPU support for Horovod.

Build Horovod with MPI or SHMEM.

To build Horovod with MPI enabled, run the build_mpi.sh script.
```
$ ./build_mpi.sh
```
To build Horovod with SHMEM enabled, run the build_shmem.sh script.
```
$ ./build_shmem.sh
```

Usage

Run Horovod with SLURM.

If you use SLURM to submit jobs, simply modify the included SLURM script to fit you cluster's configuration. Make sure to correctly load your environment before executing anything.
```
$ sbatch hvd_test_2.slurm
```
Run Horovod without SLURM.

If you have admin access to your cluster, you can copy the SLURM script into a shell script, remove the variables at the top, and execute normally using the horovodrun or oshrun (SHMEM-specific) commands.

The following is an example of running the included pytorch_mnist.py script included in the example folder on 2 nodes (denoted by the -np 2 argument). SHMEM-enabled version of Horovod has several necessary command-line arguments that may vary from system-to-system.
```
$ oshrun -np 2 -x --mca mpi_cuda_support 0 \
  --mca pml ucx --mca osc ucx \
  --mca atomic ucx --mca orte_base_help_aggregate 0 \
  --mca btl ^vader,tcp,openib,uct python3 pytorch_basic.py --epochs 1 --no-cuda
 $ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
```
To run using Open MPI without the horovodrun wrapper, see Running Horovod with Open MPI.
To run in Docker, see Horovod in Docker.
To run in Kubernetes, see Kubeflow, MPI Operator, Helm Chart, FfDL, and Polyaxon.
To run on Spark, see Horovod on Spark.
To run on Ray, see Horovod on Ray.
To run in Singularity, see Singularity.
To run in a LSF HPC cluster (e.g. Summit), see LSF.

Guides

Run distributed training in Microsoft Azure using Batch AI and Horovod.
Distributed model training using Horovod.

Send us links to any user guides you want to publish on this site

Troubleshooting

See Troubleshooting and submit a ticket if you can't find an answer.

Citation

Please cite Horovod in your publications if it helps your research:

@article{sergeev2018horovod,
  Author = {Alexander Sergeev and Mike Del Balso},
  Journal = {arXiv preprint arXiv:1802.05799},
  Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}},
  Year = {2018}
}

Publications

1. Sergeev, A., Del Balso, M. (2017) Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow. Retrieved from https://eng.uber.com/horovod/

2. Sergeev, A. (2017) Horovod - Distributed TensorFlow Made Easy. Retrieved from https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy

3. Sergeev, A., Del Balso, M. (2018) Horovod: fast and easy distributed deep learning in TensorFlow. Retrieved from arXiv:1802.05799

References

The Horovod source code was based off the Baidu tensorflow-allreduce repository written by Andrew Gibiansky and Joel Hestness. Their original work is described in the article Bringing HPC Techniques to Deep Learning.

Getting Involved

Community Slack for collaboration and discussion
Horovod Announce for updates on the project
Horovod Technical-Discuss for public discussion

Name		Name	Last commit message	Last commit date
Latest commit History 938 Commits
.buildkite		.buildkite
.github		.github
cmake		cmake
docs		docs
examples		examples
horovod		horovod
test		test
third_party		third_party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
Dockerfile.test.cpu		Dockerfile.test.cpu
Dockerfile.test.gpu		Dockerfile.test.gpu
Jenkinsfile.ppc64le		Jenkinsfile.ppc64le
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.rst		README.rst
build-docker-images.sh		build-docker-images.sh
build_mpi.sh		build_mpi.sh
build_shmem.sh		build_shmem.sh
docker-compose.test.yml		docker-compose.test.yml
horovod.exp		horovod.exp
horovod.lds		horovod.lds
hvd_test_2.slurm		hvd_test_2.slurm
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Horovod

Dependencies

Documentation

Install

Build

Usage

Guides

Troubleshooting

Citation

Publications

References

Getting Involved

About

Releases

Packages

Languages

License

collinabidi/horovod

Folders and files

Latest commit

History

Repository files navigation

Horovod

Dependencies

Documentation

Install

Build

Usage

Guides

Troubleshooting

Citation

Publications

References

Getting Involved

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages