Skip to content

outerbounds/ray-torch-distributed-checkpoint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Setup

pip install outerbounds metaflow-ray==0.1.3

Manual mode

To run training from scratch, without any pre-trained weights:

python train_flow.py --environment=fast-bakery run

Checkpoints are persisted in the Metaflow datastore, in a location unique to each task runtime. Notice the current.ray_storage_path variable exposed in Metaflow step annotated with @metaflow_ray. This variable can be passed to Ray's RunConfig(storage_path=...) value.

To run training from a previous run's checkpoint - which you can find in the Outerbounds UI or CLI output:

python train_flow.py --environment=fast-bakery run --from-run RayTorchTrain/<YOUR_RUN_ID>

To run an evaluation pipeline from the training run's checkpoint:

python eval_flow.py --environment=fast-bakery evaluate --from-run RayTorchTrain/<YOUR_RUN_ID>

Automate on the product orchestrator

Deploy the training workflow:

python train_flow.py --environment=fast-bakery argo-workflows create

Deploy the evaluation workflow:

python eval_flow.py --environment=fast-bakery argo-workflows create

To trigger the workflow, you can go the Deployments page in the Outerbounds UI and find this workflow. Alternatively, you can run this command:

python train_flow.py --environment=fast-bakery argo-workflows trigger

The evaluation workflow will be triggered automatically after the training workflow completes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages