Skip to content

petuum/autodist

Repository files navigation

pipeline status coverage report pypi version

Documentation | Examples

AutoDist is a distributed deep learning training engine for TensorFlow. AutoDist provides a user-friendly interface to distribute the training of a wide variety deep learning models across many GPUs with scalability and minimal code change.

Introduction

Different from specialized distributed ML systems, AutoDist is created to speed up a broad range of DL models with excellent all-round performance. AutoDist achieves this goal by:

  • Compilation: AutoDist expresses the parallelization of DL models as a standardized compilation process, optimizing multiple dimensions of ML parallelization including synchronization, partitioning, placement etc.
  • Composable architecture: AutoDist contains a flexible backend that can express various different ML parallelization techniques and allows for composing distribution strategies that blend different distributed ML system architectures.
  • Model and resource awareness: Based on the compilation process, AutoDist analyzes the model and generates more optimal distribution strategies that adapt to both the model properties and the cluster specification.

Besides all these advanced features, AutoDist is designed to isolate the sophistication of distributed systems from ML prototyping and exposes a simple API that makes it easy to use and switch between different distributed ML techniques for users of all levels.

For a closer look at the performance, please refer to our doc.

Using AutoDist

Installation:

pip install autodist

Modifying existing TensorFlow code to use AutoDist is easy:

import tensorflow as tf
from autodist import AutoDist

ad = AutoDist(resource_spec_file="resource_spec.yml")

with tf.Graph().as_default(), ad.scope():
    ########################################################
    # Build your (single-device) model here,
    #   and train it distributedly.
    ########################################################
    sess = ad.create_distributed_session()
    sess.run(...)

Ready to try? Please refer to the examples in our Getting Started page.

References & Acknowledgements

We learned and borrowed insights from a few open source projects including Horovod, Parallax, and tf.distribute.