AKA "Vertex AI Turbo Templates"
This repository provides a reference implementation of Vertex Pipelines for creating a production-ready MLOps solution on Google Cloud. You can take this repository as a starting point you own ML use cases. The implementation includes:
- Infrastructure-as-Code using Terraform for a typical dev/test/prod setup of Vertex AI and other relevant services
- ML training and prediction pipelines using the Kubeflow Pipelines
- Reusable Kubeflow components that can be used in common ML pipelines
- CI/CD using Google Cloud Build for linting, testing, and deploying ML pipelines
- Developer scripts (Makefile, Python scripts etc.)
Get started today by following this step-by-step notebook tutorial! 🚀 In this three-part notebook series you'll deploy a Google Cloud project and run production-ready ML pipelines using Vertex AI without writing a single line of code.
The diagram below shows the cloud architecture for this repository.
There are four different Google Cloud projects in use
dev
- a shared sandbox environment for use during developmenttest
- environment for testing new changes before they are promoted to production. This environment should be treated as much as possible like a production environment.prod
- production environmentadmin
- separate Google Cloud project for setting up CI/CD in Cloud Build (since the CI/CD pipelines operate across the different environments)
Vertex Pipelines are scheduled using Google Cloud Scheduler. Cloud Scheduler emits a Pub/Sub message that triggers a Cloud Function, which in turn triggers the Vertex Pipeline to run. In future, this will be replaced with the Vertex Pipelines Scheduler (once there is a Terraform resource for it).
Prerequisites:
- Terraform for managing cloud infrastructure
- tfswitch to automatically choose and download an appropriate Terraform version (recommended)
- Pyenv for managing Python versions
- Poetry for managing Python dependencies
- Google Cloud SDK (gcloud)
- Make
- Cloned repo
Deploy infrastructure:
You will need four Google Cloud projects dev, test, prod, and admin. The Cloud Build pipelines will run in the admin project, and deploy resources into the dev/test/prod projects. Before your CI/CD pipelines can deploy the infrastructure, you will need to set up a Terraform state bucket for each environment:
export DEV_PROJECT_ID=my-dev-gcp-project
export DEV_LOCATION=europe-west2
gsutil mb -l $DEV_LOCATION -p $DEV_PROJECT_ID --pap=enforced gs://$DEV_PROJECT_ID-tfstate && \
gsutil ubla set on gs://$DEV_PROJECT_ID-tfstate
Enable APIs in admin project:
export ADMIN_PROJECT_ID=my-admin-gcp-project
gcloud services enable cloudresourcemanager.googleapis.com serviceusage.googleapis.com --project=$ADMIN_PROJECT_ID
make deploy env=dev
More details about infrastructure is explained in this guide. It describes the scheduling of pipelines and how to tear down infrastructure.
Install dependencies:
pyenv install -skip-existing # install Python
poetry config virtualenvs.prefer-active-python true # configure Poetry
make install # install Python dependencies
cd pipelines && poetry run pre-commit install # install pre-commit hooks
cp env.sh.example env.sh
Update the environment variables for your dev environment in env.sh
.
Authenticate to Google Cloud:
gcloud auth login
gcloud auth application-default login
This repository contains example ML training and prediction pipelines which are explained in this guide.
Build containers: The model/ directory contains the code for custom training and prediction container images, including the model training script at model/training/train.py. You can modify this to suit your own use case. Build the training and prediction container images and push them to Artifact Registry with:
make build [ images="training prediction" ]
Optionally specify the images
variable to only build one of the images.
Execute pipelines: Vertex AI Pipelines uses KubeFlow to orchestrate your training steps, as such you'll need to:
- Compile the pipeline
- Build dependent Docker containers
- Run the pipeline in Vertex AI
Execute the following command to run through steps 1-3:
make run pipeline=training [ build=<true|false> ] [ compile=<true|false> ] [ cache=<true|false> ] [ wait=<true|false> ]
The command has the following true/false flags:
build
- re-build containers for training & prediction code (limit by setting images=training to build only one of the containers)compile
- re-compile the pipeline to YAMLcache
- cache pipeline stepswait
- run the pipeline (a-)sync
Shortcuts: Use these commands which support the same options as run
to run the training or prediction pipeline:
make training
make prediction
Unit tests are performed using pytest. The unit tests are run on each pull request. To run them locally you can execute the following command and optionally enable or disable testing of components:
make test [ packages=<pipelines components> ]
For details on setting up CI/CD, see this guide.
For a full walkthrough of the journey from changing the ML pipeline code to having it scheduled and running in production, please see the guide here.
We value your contribution, see this guide for contributing to this project.