MLbot offers a fast & easy way to train ML models in your cloud, directly from your laptop.
It does this by abstracting away all the complexities of setting up a compute cluster & launching distributed training jobs, behind a few simple commands.
So, as a data scientist, running your training code across multiple machines in your own cloud becomes as easy as replacing python
with mlbot run
(e.g., python train.py ...
→ mlbot run --instance-type p3dn.24xlarge --num-nodes 2 train.py ...
).
Everything else -- from packaging/deploying your code to running your job in a fault-tolerant way across multiple nodes -- gets automatically taken care of for you.
On a high level, MLbot glues together Kubernetes & PyTorch Elastic to enable you to easily launch distributed training jobs in your own infrastructure (as opposed to using a hosted service).
This way, you don't need to transfer your data to a 3rd-party, while having full flexibility & observability over the underlying compute.
Currently, MLbot depends on the following tools:
pip install mlbot-cloud
I've included a sample MNIST training script in this repo (adapted from an example provided by PyTorch Lightning) to help you try this tool out quickly.
To get started, run the following commands:
git clone https://github.com/thecooltechguy/mlbot.git
cd mlbot/examples/pytorch-lightning-image-classifier/
For this example, we'll create an AWS Elastic Kubernetes Service (EKS) cluster (running k8s version 1.21
) in the us-west-2
region to run our training jobs on:
mlbot create-cluster --cluster testcluster --version 1.21 --region us-west-2 --availability-zones us-west-2a,us-west-2b,us-west-2c --instance-types p3dn.24xlarge --max-gpu-nodes 2 --create
This command will:
- Create a new
.mlbot
folder in the current directory - Generate a new cluster configuration file for
eksctl
and save this to.mlbot/cluster.yaml
- Generate a new cluster configuration file for
mlbot
and save this to.mlbot/cluster.json
- Call
eksctl
to create a new EKS cluster using the new cluster configuration file
More info
If you'd like to only create the cluster configuration file and then separately create the EKS cluster using this file, you can run the same command as shown above, but remove the final --create
flag.
You can then separately run eksctl create cluster -f .mlbot/cluster.yaml
to manually create the EKS cluster. This can be useful if you would like to edit the file before creating the EKS cluster.
mlbot setup-cluster
This command will make our EKS cluster ready for running distributed PyTorch jobs (e.g., by installing the cluster autoscaler, metrics server, etc.).
Once the above command is done, let's scale up a single non-GPU node in our cluster to run all pending pods:
eksctl scale nodegroup --cluster=testcluster --nodes=1 standard-ng-1
Now that our cluster is ready, we can now run distributed training jobs on it!
First, we need to specify the project name and docker image name that MLbot should use for deploying this project's compute jobs. The project name will be used as the prefix for all of this project's jobs, while the docker image name will be used for the images that MLbot builds & pushes for packaging your code.
mlbot init --project-name pl-mnist-example --docker-image <the docker image name to use for this project>
Note: This command will update the project's .mlbot/config.json
file, and only needs to be run once per project.
Now that we've configured our project, we can finally run our training script in the cloud with the following command:
mlbot run --instance-type p3dn.24xlarge --az us-west-2b --num-nodes 2 train.py --trainer.gpus 8 --trainer.num_nodes 2 --trainer.strategy 'ddp' --trainer.max_epochs 100
The example command shown above will launch our training job across 2 p3dn.24xlarge
instances (with a total of 16 GPUs).
Note that everything before train.py
in the above command is basically what's replacing python
, if you were to run this training script locally.
Once the job is running, you can view its live logs by running: kubectl attach -n elastic-job <project id>-worker-0
. Similarly, to view the live logs of the second node, you can run: kubectl attach -n elastic-job <project id>-worker-1
.
To view the logs of all training nodes, a better way might be to use a tool like Stern and run stern -n elastic-job <job id> -t
To stop this job, run: mlbot stop <job id>
. This will delete the compute job from the k8s cluster.
To completely delete the compute cluster we provisioned for this example, simply run:
mlbot delete-cluster
This will delete all of the cloud resources that we had provisioned for this example.
To run your training code on spot instances in a fault-tolerant manner, simply include the --spot
flag when using mlbot run
(e.g., mlbot run --spot --instance-type p3dn.24xlarge --az us-west-2b --num-nodes 2 train.py ...
).
If a spot instance gets reclaimed, the training job will automatically pause and resume once a sufficient number of nodes are available. This applies for jobs with a fixed number of nodes and for jobs with an elastic number of nodes.
To run your training code with a dynamic number of nodes, you can use the --min-nodes
and --max-nodes
options instead of a fixed --num-nodes
, when calling mlbot run
.
To use the same cluster as project A for project B, you can copy the cluster.yaml
and config.json
files from project A's .mlbot
directory into project B's .mlbot
directory, and re-run mlbot init --project ... --docker-image ...
inside project B's directory.
There will definitely be a much better way of doing this soon, as this tool gets updated & refactored.
To learn more about the arguments & options that the various sub-commands support, you can always run the sub-commands with the --help
flag.
If you have any other questions, please create a GitHub discussion post and I'd be happy to help!
I'm currently working on updating the tool so that it can work with existing EKS clusters, instead of requiring users to create a new one. This should be ready soon!
Currently, this tool only supports running distributed PyTorch compute jobs on AWS EKS (Elastic Kubernetes Service). And it should work pretty well with PyTorch Lightning without needing any major code changes (since PyTorch Lightning already supports TorchElastic).
However, support for other frameworks and cloud environments will be added soon, and if you'd like to see a particular integration added, let me know by creating an GitHub discussion post!
- Update/refactor code
- Add tests
- Add a better way to view live logs
- Add support for other frameworks & cloud environments
Contributions and feedback are always welcome! :)