Google Cloud Resiliency

Table of content

Overview
- Motivation
- Hierarchy of ML frameworks within this library
Resiliency Features
- Optimized Checkpointing
- Holistic Cluster Control
Tools
Usage
Help

Overview

The Resiliency library provides resilient training code used to improve training goodput (aka training efficiency) of distributed workloads. For those unfamiliar with the term "goodput", please read Google Cloud Goodput Blogpost. This library provides resilency features built on-top of NVIDIA NeMo 2.0 training recipes. Reproducible benchmarks demonstrating high training goodput using this library are defined in AI-Hypercomputer/gpu-recipes.

Motivation

The scale of large distributed AI training jobs results in frequent failures and interruptions. This library exists to help minimize the impact of failures and interruptions on training, ultimately improving training goodput for training workloads on Google Cloud.

Hierarchy of ML frameworks within this library

This library intertwines several popular ML frameworks and adds features on top of them. The following diagram provides a visual representation of the ML resiliency stack and the benifits that come with each layer:

Resiliency Features

The key features supported in the resiliency library focus on optimizing checkpointing and increasing resilience of the training infrastructure through holistic cluster control. The following sections will dive deeper into these two focus areas. Please note that some of these features are still experimental.

Optimized Checkpointing

This library provides several checkpointing optimizations:

Emergency Checkpointing
Optimized Asynchronous Checkpointing
In-Cluster Checkpointing

Emergency Checkpointing is when a termininating workload can immediately save its current state before termination so that no training progress is lost. This functionality is defined in autocheckpoint.py.

Optimized asynchronous checkpointing builds on-top of NeMo's asynchronous checkpointing functionality and utilizes several recent changes added to Torch Distributed Checkpointing. The changes supported here define a dedicated checkpointing subprocess that is long-living throughout the duration of the training job and is used to asynchronously write checkpoint state to file once checkpoint state is offloaded from CUDA to Host memory. This optimization also offloads checkpoint state as early as possible to minimize the duration in which training is blocked. This functionality is defined in persistent_ckpt_proc.py and min_ckpt_overhead.py. To enable optimized asynchronous checkpointing, please include training workload flag --enable-optimized-async-ckpt.

In-cluster checkpointing supports the ability for the workload to save and load checkpoints without a global persistent storage and is built on top of recent changes to Torch Distributed Checkpointing. Checkpointing within the cluster results in faster read and write times. When using in-cluster checkpointing, training code can save checkpoints more frequently, reducing the potential loss in training progress when the workload is interrupted. This functionality is defined in in_cluster_local_ckpt.py and replication_utils.py. There are separate flags for saving checkpoints to local and persistent storage. Setting --local-ckpt-dir and --local-ckpt-interval enable in-cluster checkpointing. Setting --persistent-ckpt-dir and --persistent-ckpt-interval enable checkpointing to persistent storage. A multi-tiered checkpointing solution should include checkpointing to both local and persistent storage, which is supported by enabling all 4 flags above related to local and persistent checkpoints.

Holistic Cluster Control

Holistic cluster control is supported via the Supervisor. The supervisor comprises of three components based on control theory:

Sensor: Senses signals throughout the cluster and ingests them into the Controller
Controller: Defines logic in response to signals received from sensor.
Actuator: Applies logic to training cluster.

This library includes the client facing portion of the Supervisor in supervisor. One instance of the sensor, controller, and actuator should be deployed on CPU hosts per cluster. Each training host (GPUs) adjacent to the training workload should also run host_daemon.py as a daemon deployment. The host daemon deployment on each host is responsible for monitoring the training processes on the same host and will relay all signal to the sensor deployment.

The Supervisor builds upon the resiliency support provided by NVIDIA's NvRx library. In particular, it utilizes the heartbeating mechansisms defined in NvRx as a signal to whether each node is training as expected. When interruptions occur, the training job relies on the in-job restart support provided by NvRx.

Once the Supervisor is deployed, please include launcher flag --use-supervisor to attach the training job to the Supervisor.

Tools

This library includes several tools to help with resiliency studies.

Goodput calculator: resiliency/goodput_measure includes tools to calculate training goodput. Please take a look at the readme for more information on how to use the goodput calculator.
Failure simulator: simulator provides an entrypoint script to simulate failures for the workload. To learn more about the script and its command line arguments, please run the following: python3 simulator/simulator_entrypoint.py --help

Usage

To use this library, we recommend using it directly out of Docker container us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-goodput:latest, which includes all needed dependencies already installed. Our goodput training recipes defined AI-Hypercomputer/gpu-recipes provide instructions for deploying the Supervisor and checkpoint optimization features on GKE.

If you want to use this library from scratch, please install it as a package via

pip install -e .

Within the docker container, use benchmark.py to run training utilizing the library's resiliency features. We recommend running the training script using the launcher in launcher.py.

The launch command from the root directory looks like the following:

python3 launcher.py \
  ${LAUNCHER_FLAGS} \
  benchmark/benchmark.py \
  ${WORKLOAD_FLAGS}

where ${LAUNCHER_FLAGS} and ${WORKLOAD_FLAGS} are all flags specific to the launcher and workload respectively.

For a complete list of launcher flags included in launcher.py, please run

python3 launcher.py --help

For a complete list of all flags included in benchmark.py, please run

python3 benchmark/benchmark.py --help

Help

If you run into any issues when using this library, please file an issue or create a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
docs/media		docs/media
resiliency		resiliency
simulator		simulator
supervisor		supervisor
tests/resiliency		tests/resiliency
third_party		third_party
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
launcher.py		launcher.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Cloud Resiliency

Table of content

Overview

Motivation

Hierarchy of ML frameworks within this library

Resiliency Features

Optimized Checkpointing

Holistic Cluster Control

Tools

Usage

Help

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

AI-Hypercomputer/resiliency

Folders and files

Latest commit

History

Repository files navigation

Google Cloud Resiliency

Table of content

Overview

Motivation

Hierarchy of ML frameworks within this library

Resiliency Features

Optimized Checkpointing

Holistic Cluster Control

Tools

Usage

Help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages