Skip to content

outerbounds/metaflow-checkpoint-examples

Repository files navigation

Metaflow @checkpoint/@model/@huggingface_hub Examples

Long-running data processing and machine learning jobs often present several challenges:

  1. Failure Recovery: Recovering from failures can be painful and time-consuming.

    • Example: Suppose you're training a deep learning model that takes 12 hours to complete. If the process crashes at the 10-hour mark due to a transient error, without checkpoints, you'd have to restart the entire training from scratch.
    • Example: During data preprocessing, you generate intermediate datasets like tokenized text or transformed images. Losing these intermediates means re-running expensive computations, which can be especially problematic if they took hours to create.
  2. External Dependencies: Jobs may require large external data (e.g., pre-trained models) that are cumbersome to manage.

    • Example: Loading a pre-trained transformer model from Hugging Face Hub can take a significant amount of time and bandwidth. If this model isn't cached, every run or worker node (in a distributed training context) would need to download it separately, leading to inefficiencies.
  3. Version Control in Multi-User Environments: Managing checkpoints and models in a multi-user setting requires proper version control to prevent overwriting and ensure correct loading during failure recovery.

    • Example: If multiple data scientists are training models and saving checkpoints to a shared storage, one user's checkpoint might accidentally overwrite another's. This can lead to confusion and loss of valuable work. Moreover, when a job resumes after a failure, it must load the correct checkpoint corresponding to that specific run and user.

To address these challenges, Metaflow introduces the @checkpoint/ @model/ @huggingface_hub decorators, which simplify the process of saving and loading checkpoints and models within your flows. These decorators ensure that your long-running jobs can be resumed seamlessly after a failure, manage external dependencies efficiently, and maintain proper version control in collaborative environments.

This repository contains a gallery of examples demonstrating how to leverage @checkpoint/@model/@huggingface_hub to overcome the aforementioned challenges. By exploring these examples, you'll learn practical ways to integrate checkpointing and model management into your workflows, enhancing robustness, efficiency, and collaboration.**


Starter Examples

Basic Checkpointing with @checkpoint:

These starter examples introduce the fundamentals of checkpointing and model saving. They show how to implement @checkpoint in simple training workflows, ensuring that you can recover from failures without losing progress. You'll also see how @model helps in saving and loading models/checkpoints effortlessly.


Intermediate Examples

Checkpointing with Large Models and Managing External Dependencies:

These intermediate examples dive into more complex scenarios where managing large external models becomes crucial. You'll learn how to use @checkpoint/@model alongside external resources like Hugging Face Hub (with @huggingface_hub).


Advanced Examples

Checkpointing and Failure Recovery in Distributed Training Environments:

The advanced examples focus on distributed training environments where the complexity of failure recovery and model management increases. You'll explore how @checkpoint facilitates seamless recovery across multiple nodes.

About

Examples for the Metaflow Checkpoint Extension

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published