README: ML Pipeline with BigQuery ML and Dataform

Overview

This project demonstrates how to build a machine learning (ML) pipeline using BigQuery ML within a Dataform environment. We use a public dataset of Chicago Taxi Trips to train, evaluate, and predict fare amounts based on features like trip duration, distance, and payment type. The ML models are trained and executed within Google BigQuery using SQL, and the overall workflow is managed with Dataform for version control and reproducibility.

Files and Structure

1. Data Extraction

We prepare three datasets: training, evaluation, and testing, by applying necessary transformations and feature engineering on the raw Chicago Taxi Trips data.
Files:
- definitions/sources/chicago_taxi_trips.sqlx: Declares the public dataset as a source.
```
config {
    type: "declaration",
    database: "bigquery-public-data",
    schema: "chicago_taxi_trips",
    name: "taxi_trips"
}
```
- definitions/models/01_data_extraction/chicago_taxi_trips_training.sqlx: Prepares the training dataset by selecting trips between 2 and 5 months before the most recent date in the dataset and applying filtering and transformations.
```
config {
    type: "table",
    schema: "df_bqml_chicago_taxi_trips"
}

-- Transforms and filters data for training
```
- definitions/models/01_data_extraction/chicago_taxi_trips_evaluation.sqlx: Extracts the evaluation dataset, selecting trips between 1 and 2 months before the most recent date, following similar filtering and transformation steps as the training set.
```
config {
    type: "table",
    schema: "df_bqml_chicago_taxi_trips"
}

-- Transforms and filters data for evaluation
```
- definitions/models/01_data_extraction/chicago_taxi_trips_testing.sqlx: Prepares the testing dataset, using the most recent month of trips. This dataset is used for model prediction after training.
```
config {
    type: "table",
    schema: "df_bqml_chicago_taxi_trips"
}

-- Transforms and filters data for testing
```

2. Model Training

The machine learning model is trained using BigQuery ML. Specifically, we implement a DNN Regressor (Deep Neural Network) model to predict the total fare based on various features extracted from the taxi trips.
Files:
- definitions/models/02_model_training/dnn_reg.sqlx: Trains a Deep Neural Network (DNN) regressor model with the extracted training dataset. The target label is the fare, including tips, tolls, and extras.
```
CREATE OR REPLACE MODEL ${self()}  
OPTIONS (
   MODEL_TYPE = 'DNN_REGRESSOR',
   ACTIVATION_FN = 'RELU',
   HIDDEN_UNITS = [64, 64, 64, 64],
   INPUT_LABEL_COLS = ['label']
  )
```

3. Model Evaluation

Once the model is trained, we evaluate its performance using the ML.EVALUATE function in BigQuery ML. The evaluation focuses on metrics like mean absolute error (MAE), and assertions are added to validate model performance.
Files:
- definitions/assertions/evaluate_model.sqlx: Evaluates the performance of the trained model on the evaluation dataset. The assertion ensures that the mean absolute error (MAE) is below a specified threshold.
```
SELECT
   *
FROM
   ML.EVALUATE(
      MODEL ${ref('dnn_reg')}, 
      (
         SELECT * FROM ${ref('chicago_taxi_trips_evaluation')}
      )
   )
WHERE mean_absolute_error > 5
```

4. Model Prediction

Finally, the trained model is used to predict fares on the testing dataset. The predicted values are rounded and included in the output.

Files:

definitions/models/03_model_prediction/predict.sqlx: Uses the trained DNN Regressor to predict fares on the testing dataset and rounds the predictions for better readability.

   SELECT
      dayofweek,
      hourofday,
      trip_distance,
      trip_miles,
      trip_seconds,
      payment_type,
      company,
      ROUND(predicted_label, 2) AS predicted_fare
   FROM
      ML.PREDICT(
         MODEL ${ref('dnn_reg')},
         (
            SELECT * FROM ${ref('chicago_taxi_trips_testing')}
         )
      )

Workflow and Pipeline

Data Extraction: Extracts and transforms the Chicago Taxi Trips data from BigQuery for three datasets: training, evaluation, and testing.
Model Training: A DNN model is trained using the training dataset to predict the total fare amount.
Model Evaluation: The model is evaluated on the evaluation dataset, and an assertion is made to ensure acceptable model performance.
Model Prediction: The model is used to predict fares on the testing dataset, and the results are output as a new table.

Technologies Used

Dataform: Manages the SQL transformations and model training as part of a reproducible and version-controlled pipeline.
BigQuery ML: Used for training, evaluating, and predicting with machine learning models directly within BigQuery using SQL.
SQL: All transformations, data extraction, and model training are done using SQL scripts managed by Dataform.

Running the Pipeline

To execute the pipeline, ensure you have access to the Google BigQuery project and Dataform workspace. The entire process from data extraction to prediction is automated through Dataform's environment, ensuring a fully managed ML pipeline.

Conclusion

This project showcases how to leverage BigQuery ML and Dataform to implement a fully automated ML pipeline for predicting taxi fares in Chicago. The combination of data extraction, transformation, model training, evaluation, and prediction in a single pipeline makes it easy to manage and scale ML workflows in a cloud-based environment.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
definitions		definitions
.gitignore		.gitignore
README.md		README.md
workflow_settings.yaml		workflow_settings.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README: ML Pipeline with BigQuery ML and Dataform

Overview

Files and Structure

1. Data Extraction

2. Model Training

3. Model Evaluation

4. Model Prediction

Workflow and Pipeline

Technologies Used

Running the Pipeline

Conclusion

About

Releases

Packages

teamdatatonic/dta_bqml_df

Folders and files

Latest commit

History

Repository files navigation

README: ML Pipeline with BigQuery ML and Dataform

Overview

Files and Structure

1. Data Extraction

2. Model Training

3. Model Evaluation

4. Model Prediction

Workflow and Pipeline

Technologies Used

Running the Pipeline

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages