metaflow-ray
is an extension for Metaflow that enables seamless integration with Ray, allowing users to easily leverage
Ray's powerful distributed computing capabilities within their Metaflow flows. With metaflow-ray
, you can spin up ephemeral Ray clusters on AWS Batch or Kubernetes directly from your Metaflow steps using the @metaflow_ray
decorator. This enables you to run your Ray applications that leverage Ray Core, Ray Train, Ray Tune, and Ray Data effortlessly within your Metaflow flow.
- Effortless Ray Integration: This extension provides a simple and intuitive way to incorporate Ray
into your Metaflow workflows using the
@metaflow_ray
decorator. - Elastic Ephemeral Ray Clusters: Let Metaflow orchestrate the creation of ephemeral Ray clusters on top of either:
- AWS Batch multi-node parallel jobs
- Kubernetes JobSets
- Seamless Ray Initialization: The
@metaflow_ray
decorator handles the initialization of the Ray cluster for you, so you can focus on writing your Ray code without worrying about cluster setup - Wide Range of Applications: Run a wide variety of Ray applications, including hyperparameter tuning, distributed data processing, and distributed training, etc.
You can install metaflow-ray
via pip
alongside your existing Metaflow installation:
pip install metaflow-ray
- Import the
@metaflow_ray
decorator to enable integration:
from metaflow import metaflow_ray
- Decorate your step with
@metaflow_ray
and Initialize Ray within Your Step:
@step
def start(self):
self.next(self.train, num_parallel=NUM_NODES)
@metaflow_ray
@pypi(packages={"ray": "2.39.0"})
@batch(**RESOURCES) # You can even use @kubernetes
@step
def train(self):
import ray
ray.init()
# Your step's training code here
self.next(self.join)
@step
def join(self, inputs):
self.next(self.end)
@step
def end(self):
pass
- The
num_parallel
argument must always be specified in the step preceding the transition to a step decorated with@metaflow_ray
. In the example above, thestart
step transitions to thetrain
step, and it includes thenum_parallel
argument because thetrain
step is decorated with@metaflow_ray
. This ensures thetrain
step can execute in parallel as intended.
- As a consequence, there must always exist a corresponding
join
step as highlighted in the snippet above.
- For remote execution environments (i.e.
@metaflow_ray
is used in conjunction with@batch
or@kubernetes
), the value ofnum_parallel
should greater than 1 i.e. at least 2. However, when using the@metaflow_ray
decorator in a standalone manner, the value ofnum_parallel
cannot be greater than 1 (on Windows and macOS) because locally spun up ray clusters do not support multiple nodes unless the underlying OS is linux based.
- Ideally,
ray
should be available in the remote execution environments. If not, one can always use the@pypi
decorator to introduceray
as a dependency.
- If the
@metaflow_ray
decorator is used in a local context i.e. without@batch
or@kubernetes
, a local ray cluster is spinned up, provided that theray
library (installable viapip install ray
) is available in the underlying python environment. Running the flow again (locally) could result in the issue of:
ConnectionError: Ray is trying to start at 127.0.0.1:6379, but is already running at 127.0.0.1:6379.
Please specify a different port using the `--port` flag of `ray start` command.
One can simply run ray stop
in another terminal to terminate the ray cluster that was spun up locally.
Check out the examples directory for sample Metaflow flows that demonstrate how to use the metaflow-ray
extension
with various Ray applications.
Directory | Description |
---|---|
Counter | Run a basic Counter with Ray that increments in Python, then do it inside a Metaflow task! |
Process Dataframe | Process a large dataframe in chunks with Ray and Python, then do it inside a Metaflow task! |
Custom Docker Images | Specify custom docker images on kubernetes / batch with Ray on Metaflow |
Train XGBoost | Use Ray Train to build XGBoost models on multiple nodes, including CPU and GPU examples. |
Tune PyTorch | Use Ray Tune to build PyTorch models on multiple nodes, including CPU and GPU examples. |
PyTorch Lightning | Get started with running a PyTorch Lightning job on the Ray cluster formed in a @metaflow_ray step. |
GPT-J Fine Tuning | Fine tune the 6B parameter GPT-J model on a Ray cluster. |
vLLM Inference | Run Inference on Llama models with vLLM and Ray via Metaflow. |
End-to-end Batch Workflow | Train models, evaluate them, and serve them. See how to use Metaflow workflows and various Ray abstractions together in a complete workflow. |
metaflow-ray
is distributed under the Apache License.