Long-running data processing and machine learning jobs often present several challenges:
-
Failure Recovery: Recovering from failures can be painful and time-consuming.
- Example: Suppose you're training a deep learning model that takes 12 hours to complete. If the process crashes at the 10-hour mark due to a transient error, without checkpoints, you'd have to restart the entire training from scratch.
- Example: During data preprocessing, you generate intermediate datasets like tokenized text or transformed images. Losing these intermediates means re-running expensive computations, which can be especially problematic if they took hours to create.
-
External Dependencies: Jobs may require large external data (e.g., pre-trained models) that are cumbersome to manage.
- Example: Loading a pre-trained transformer model from Hugging Face Hub can take a significant amount of time and bandwidth. If this model isn't cached, every run or worker node (in a distributed training context) would need to download it separately, leading to inefficiencies.
-
Version Control in Multi-User Environments: Managing checkpoints and models in a multi-user setting requires proper version control to prevent overwriting and ensure correct loading during failure recovery.
- Example: If multiple data scientists are training models and saving checkpoints to a shared storage, one user's checkpoint might accidentally overwrite another's. This can lead to confusion and loss of valuable work. Moreover, when a job resumes after a failure, it must load the correct checkpoint corresponding to that specific run and user.
To address these challenges, Metaflow introduces the @checkpoint
/ @model
/ @huggingface_hub
decorators, which simplify the process of saving and loading checkpoints and models within your flows. These decorators ensure that your long-running jobs can be resumed seamlessly after a failure, manage external dependencies efficiently, and maintain proper version control in collaborative environments.
This repository contains a gallery of examples demonstrating how to leverage @checkpoint
/@model
/@huggingface_hub
to overcome the aforementioned challenges. By exploring these examples, you'll learn practical ways to integrate checkpointing and model management into your workflows, enhancing robustness, efficiency, and collaboration.**
Basic Checkpointing with @checkpoint
:
- MNIST Training with Vanilla PyTorch
- MNIST Training with Keras
- MNIST Training with PyTorch Lightning
- MNIST Training with Hugging Face Transformers
- Saving XGBoost Models as Part of the Model Registry
These starter examples introduce the fundamentals of checkpointing and model saving. They show how to implement @checkpoint
in simple training workflows, ensuring that you can recover from failures without losing progress. You'll also see how @model
helps in saving and loading models/checkpoints effortlessly.
Checkpointing with Large Models and Managing External Dependencies:
- Training LoRA Models with Hugging Face
- Training LoRA Models on NVIDIA GPU Cloud with
@nvidia
- Generating Videos from Text Using Stable Diffusion XL and Stable Diffusion Video
These intermediate examples dive into more complex scenarios where managing large external models becomes crucial. You'll learn how to use @checkpoint
/@model
alongside external resources like Hugging Face Hub (with @huggingface_hub
).
Checkpointing and Failure Recovery in Distributed Training Environments:
The advanced examples focus on distributed training environments where the complexity of failure recovery and model management increases. You'll explore how @checkpoint
facilitates seamless recovery across multiple nodes.