Skip to content

Latest commit

 

History

History
133 lines (95 loc) · 26.5 KB

File metadata and controls

133 lines (95 loc) · 26.5 KB

Pipelines

Introduction

This repository provides templates and reference implementations of Vertex AI Pipelines for production-grade training and batch prediction pipelines on GCP for:

Further useful documentation:

The sections below provide a general description of the ML pipelines (training and prediction) for both the TensorFlow template and XGBoost template. These two templates are similar in most ways and a complete overview of their key differences are given in their own README files:

Training Pipeline

Prerequisites for training pipeline

Optional: an existing champion model

Components in training pipeline

The training pipeline can be broken down into the following sequence of components at a high level:

Step No Step Name Description Input(s) Output(s)
1 Generate Queries Generate base preprocessing & train-test-validation split queries for Google BigQuery. This component only needs a .sql file and all parametrized values in that .sql file (source_dataset, source_table, filter_column, and so on)
2 BQ Query to Table This component takes the generated query from the previous component & runs it on Google BigQuery to create the required table (preprocessed data/ train-test-validation data). Output from Generate Queries New Google BigQuery table created
3 Extract BQ to Dataset Creates CSV file/s in Google Cloud Storage from the Google BigQuery tables created in the previous component BigQuery table created from BQ Query to Table BigQuery table converted to Google Cloud Storage objects as CSV/JSONL files and corresponding file directory and path
4 Vertex Training Run a Vertex Training job with train-validation data using a training component wrapped in a ContainerOp from google-cloud-pipeline-components
  • Train/Validation data - Google Cloud Storage CSV files for train_data + valid_data from Extract BQ to Dataset
  • Model Parameters - Specific model parameters for model training
Trained challenger model object/s or binaries stored in Google Cloud Storage
5 Challenger Model Predictions Use the trained model to get challenger predictions for evaluation purposes
  • Test data - Google Cloud Storage CSV files for test_data from Extract BQ to Dataset
  • Trained Model - Trained challenger model binaries stored in Google Cloud Storage from Vertex Training
Challenger predictions on test data stored as CSV files in Google Cloud Storage
6 Calculate Evaluation Metrics Use predictions from previous component to compute user-defined evaluation metrics. This component uses TFMA.
  • Test data predictions stored as CSV files in Google Cloud Storage from Challenger Model Predictions
  • Data slices if required
    Evaluation metrics stored in Google Cloud Storage. Plots for all evaluation metrics and slices stored as HTML files in Google CLoud Storage
    7 Lookup Model Fetch the required model and its resource name if a previous champion model exists in Vertex AI Base model name to check if it exists in Vertex AI Model resource name as a string if a champion model exists else an empty string
    8 Champion-Challenger Comparison All model training activities until this point were for a new challenger model. Basic idea is to compare this newly trained challenger model with an existing champion model & decide deployment of this challenger model based on performance comparison.
    • If champion model does not exist - Upload Challenger model as the new Champion model to Vertex AI
    • If champion model exists -
      • Calculate evaluation metrics - Compute user-defined evaluation metrics on the same test_data
      • Compare Models - Compare computed evaluation metrics for the champion model
        • If challenger model performance is better - Upload Challenger model as the new Champion model to Vertex AI
        • If champion model performance is better - End pipeline (Champion model remains as is in Vertex AI)

    Each of these components can be connected using .after({component_name}) or if the output of a preceding component is used as input for that component Every component can be set with a display name using .set_display_name({display_name}). Note that this applies to both the training pipeline and the prediction pipeline.

    Prediction Pipeline

    Prerequisites for prediction pipeline

    • A champion model
    • A successful run of the training pipeline

    Components in prediction pipeline

    The prediction pipeline can be broken down into the following sequence of components at a high level:

    Step No Name Description Input(s) Output(s)
    1 Generate Queries Generate base preprocessing & prediction data creation queries for Google BigQuery. This component only needs a .sql file & all parametrized values in that .sql file (source_dataset, source_table, filter_column etc)
    2 BQ Query to Table This component takes the generated query from the previous component & runs it on Google BigQuery to create the required table (preprocessed data/ prediction data). Output from Generate Queries New Google BigQuery table created
    3 Extract BQ to Dataset Creates CSV/JSONL file/s in Google Cloud Storage from the Google BigQuery tables created in the previous component BigQuery table created from BQ Query to Table. Full table name required i.e {project_id}.{dataset_id}.{table_id} BigQuery table converted to Google Cloud Storage objects as CSV/JSONL files, and corresponding file directory and path
    4 Lookup Model Fetch the required model resource name for a champion model in Vertex AI. Since the prediction pipeline will always run after the training pipeline, a champion model will always exist Base champion model name as a string Champion model resource name as a string
    5 Vertex Batch Predictions from Google Cloud Storage Run a Vertex Batch Prediction job with prediction data as input in Tensorflow prediction Pipeline
    • Prediction data - The uris of Google Cloud Storage CSV/JSONL files for prediction_data from Extract BQ to Dataset
    • Model - Champion model as per Vertex AI
    A batch prediction job artifact with metadata: resourceName(batch prediction job ID) and gcsOutputDirectory(output JSONL files in Google Cloud Storage)
    6 Vertex Batch Predictions from BigQuery Run a Vertex Batch Prediction job with prediction data as input in XGBoost prediction Pipeline
    • Prediction data - Google BigQuery table of prediction_data from Extract BQ to Dataset
    • Model Resource name - Champion model as per Vertex AI
    A batch prediction job artifact with metadata: resourceName(batch prediction job ID) and bigqueryOutputTable(output BigQuery tables)
    7 Load Dataset to BQ Upload the batch predictions stored in Google Cloud Storage to a BigQuery table in Tensorflow prediction Pipeline Batch predictions stored as JSONL files in Google Cloud Storage from Vertex Batch Predictions The specific URL could be read from metadata from batch prediction job Google BigQuery table with the batch predictions & input instances/features

    Each of these components can be connected using .after({component_name}) or if the output of a preceding component is used as input for that component Every component can be set with a display name using .set_display_name({display_name}). Note that this applies to both the training pipeline and the prediction pipeline.

    The final loading stage has an optional argument dataset_location which is the location to run the loading job and must match the location of the destination table. It is defaulted to "EU" in the component definition.

    Trigger

    • Trigger script can be found here

    Pipeline configuration

    In order to orchestrate machine learning (ML) pipelines on Vertex AI, you need to configure a few things.

    env.sh

    The first thing to do is to copy env.sh.example to env.sh, and fill in the values of env.sh with the values relevant to your development/sandbox environment. These environment variables will be used when you trigger pipeline runs in Vertex AI.

    Pipeline input parameters

    The ML pipelines have input parameters. As you can see in the pipeline definition files (pipelines/<xgboost|tensorflow>/<training|prediction>/pipeline.py), they have default values, and some of these default values are derived from environment variables (which in turn are defined in env.sh as described above).

    When triggering ad hoc runs in your dev/sandbox environment, or when running the E2E tests in CI, these default values are used. For the test and production deployments, the pipeline parameters are defined in the Terraform code for the Cloud Scheduler jobs (envs/<test|prod>/variables.auto.tfvars).

    Python packages and Docker images

    You can specify the Python base image and packages required for KFP components in the @component decorator using the base_image and packages_to_install arguments respectively.

    Compute resources configuration in pipeline

    In general there are two methods to configure compute resources in each pipeline. Firstly, by setting the machine_type variable in XGBoost training pipeline, XGBoost prediction pipeline, TensorFlow training pipeline, TensorFlow prediction pipeline. The default value is n1-standard-4 with 4 core CPUs and 15GB memory. Secondly, in order to manage the requirements of each step in your pipeline, you can set up machine type on the pipeline steps. This is because some steps might need more computational resources than others. You can increase CPU and memory limits by applying .set_cpu_limit({CPU_LIMIT}) and .set_memory_limit('MEMORY_LIMIT') for any component.

    • CPU_LIMIT: The maximum CPU limit for this operator. This string value can be a number (integer value for number of CPUs), or a number followed by "m", which means 1/1000. You can specify at most 96 CPUs.
    • MEMORY_LIMIT: The maximum memory limit for this operator. This string value can be a number, or a number followed by "K" (kilobyte), "M" (megabyte), or "G" (gigabyte). At most 624GB is supported.

    For more information, please refer to the guide on specifying machine types for a pipeline step.

    Cache Usage in pipeline

    When Vertex AI Pipelines runs a pipeline, it checks to see whether or not an execution exists in Vertex ML Metadata with the interface (cache key) of each pipeline step (component). If the component is exactly the same and the arguments are exactly the same as in some previous execution, then the task can be skipped and the outputs of the old step can be used. Since most of the ML projects take a long time and expensive computation resources, it is cost-effective to use cache when you are sure that the output of components is correct. In terms of how to control cache reuse behavior, in generally, you can do it for either a component or the entire pipeline (for all components). If you want to control caching behavior for individual components, add .set_caching_options(<True|False>) after each component when building a pipeline. To change the caching behaviour of ALL components within a pipeline, you can specify this when you trigger the pipeline like so: make run pipeline=<training|prediction> enable_caching=<true|false> It is suggested to start by disabling caching of components during development, until you have a good idea of how the caching behaviour works, as it can lead to unexpected results.

    Note: Disable caching if you want to use new data: When caching is enabled for the pipeline, changing timestamp or source of data (such as ingestion_dataset_id) will only change the output of corresponding step, for example Ingest data. While for the other components, which take the same arguments, they will use caches instead of using the new data.

    Champion / Challenger evaluation

    In the training pipelines, a Champion-Challenger evaluation is conducted via the models with same name pattern in the same project. Explore lookup_model.py for more detailed information. In practice, you should be aware of that and give the model a specific name related to the ML project you are working on once the new model is not comparable with the previous models. For example, when you want to train a new model using different features, the best practice is to change you model name in the pipeline input parameters.