The tf.distribute.Strategy
allows TensorFlow developers to distribute model training to multiple GPUs/TPUs and machines. This repository implements the Metaflow @tensorflow
decorator, which sets up a multi-node Metaflow step to use this functionality.
Install this experimental module:
pip install metaflow-tensorflow
This package will add a Metaflow extension to your already installed Metaflow, so you can use the tensorflow
decorator.
from metaflow import FlowSpec, step, tensorflow, ...
The rest of this README.md
file describes how you can use TensorFlow with Metaflow in the single node and multi-node cases which require @tensorflow
.
The examples in this repository are based on the original TensorFlow Examples.
Directory | TensorFlow script description |
---|---|
MirroredStrategy | Synchronous distributed training on multiple GPUs on one machine. |
MultiWorkerMirroredStrategy | Synchronous distributed training across multiple workers, each with potentially multiple GPUs. |
Not yet tested, please reach out to the Outerbounds team if you need help.
From TensorFlow documentation: Do not install TensorFlow with conda. It may not have the latest stable version. pip is recommended since TensorFlow is only officially released to PyPI.
We have found the easiest way to install TensorFlow for GPU is to use the pre-made Docker image tensorflow/tensorflow:latest-gpu
.
See TensorFlow documentation on this matter.
The TL;DR is to use a flavor of tf.distribute.Strategy
, which implement mechanisms to handle worker failures gracefully.
metaflow-tensorflow
is distributed under the Apache License.