Skip to content

MLOps pipeline for the training and deployment of a CV model for dog recognition

Notifications You must be signed in to change notification settings

josiah-chua/doggo-recog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This personal MLOps project is to implement an end-to-end pipeline to create a telegram bot that uses deep learning to differentiate dog breeds. The main objectives are to test the effectiveness of Neptune AI, an ML model monitoring and storage system, and familiarise myself with PyTorch Lightning a Pytorch wrapper package that allows for faster implementation of training deep learning models.

MLOps Pipeline

image

File structure

doggo-recog/
├── credentials/
├── data/
│   ├── deployment_model/
│   ├── model_data/
│   ├── new_data/
│   └── temp_data/
├── src/
│   ├── a_preprocessing.py
│   ├── b_experiment.py
│   ├── c_training.py
│   ├── d_api.py
│   └── utils/
│       ├── utils_data.py
│       ├── utils_envvar.py
│       ├── utils_gcs.py
│       ├── utils_model.py
│       ├── utils_preprocessing.py
│       └── utils_training.py
├── .env
├── venv
├── .gitignore
├── README.md
└── .requirements.txt

Original Dataset

The dataset can be found in the dataset branch. It is from the Standford dogs dataset but the filename has been cleaned up to include the label and a unique id (based on datetime ms) for each photo and has been converted to jpg. A similar naming convention will also be used in future implementation of data collection from users. The dataset has been cleaned beforehand and formatted correctly into its 3 channels (RGB) as a couple of photos were noted to have 4 channels (png photos contain a 4th opacity channel).

image

Preprocessing

Since the current dataset is quite clean, the preprocessing script currently only has a placeholder preprocessing function. However should user data be stored, there might be a need for further preprocessing which will be added to the function. This could be a future enhancement

After the preprocessing, the script generates npy array of the photos and stores the numpy array files in the GCS preprocessed bucket and the raw images in the raw image bucket. There should be 4 different GCS buckets, new-raw-images, new-preprocessed-images, raw-images, preprocessed-images. In this first iteration, the buckets with new will not be used as they are used for storing new user data for retraining, and the files should be stored in the other 2 buckets.

There are 2 make executors for the preprocessing stage:

make preprocess_training_files: will store the preprocessed files into raw-images, preprocessed-images buckets

make preprocess_new_files : will store the preprocessed files into new-preprocessed-images, raw-images buckets

Experimentation

The experiment and training are similar, however, the experimentation section is used because deep learning models take a long time to train. Hence to evaluate the effectiveness of certain models and certain hyperparameters, we can do smaller experiments with fewer classes and lesser data per class to have a rough gauge of the performance. While having a better performance in smaller datasets will not guarantee better results in the full dataset, it will be able to give us a sense of how effective it is by looking at the training metrics and prior knowledge of the model architecture.

This is where Pytorch Lightning and Neptune.ai come in. Pytorch lightning allows us to package our models nicely and easily attach loggers and callbacks for to monitor our training progress. Training metrics can then be sent to the Neptune server through the NeptuneLogger and visualized on a dashboard.

image

For the experimentation stage, one can experiment with different model architectures (vgg, efficientnet, resnet) and these have been initiated with pre-trained ImageNet Weights(IMAGENET1K_V1) for transfer learning to reduce the training time needed.

The other hyperparameters to try are:

  • image input size (default 224)
  • batch size (default 32)
  • learning rate (default 1e-3)
  • transformation probability (default = 0.025): Chance that the training image will go through a random augmentation

These hyperparameters will also be stored in Neptune in the corresponding run.

Training

Before training, the images are converted into pytorch datasets, with transform functions such as, change contrast, perspective, brightness blur etc. and these augmentations will occur randomly as the dataset is loaded every epoch, allowing the model to generalize features better and become more robust. To understand it better. Hence there is not much need to generate transformed data in the preprocessing stage as new data although you can still do so if you want.

The training process is handled by pytorch lightning trainer and the callbacks used are early stopping and checkpointing. The patience and max epochs can both be set for the training and early stopping. The checkpointing will only checkpoint the last training weights and the weights of the best-performing epoch (this will only be saved for the training stage). The model used MulticlassROCAUC for evaluation.

In the training stage, deployment artifacts will also be stored in Neptune's model registry. The best checkpoint (for retaining) along with its torchscript model file(for deployment) and the label dict to convert the labels to the corresponding dog breeds, will be saved to the model registry.

In the model registry, deployment of different models is straightforward as it will just require changing that model stage to production. When deployed the API will search for the models in production and take the first model to be used.

image

Usage

Data

The data can be downloaded from the data branch and the contents should be placed into data/model_data. Then run make preprocess_training_files to preprocess the files.

Also note all data folders have been git ignored so you will have to make them. When terraform code is done it will create the folders but for now too bad.

Set up GCP storage bucket

Create the 4 Google Cloud Storage Buckets with the same name defined in your environment variables. Terraform code to deploy infra will be added soon

Environment Variable

There is a sample environment variable file where you can add the environment variables. But please rename it to .env. The Google Cloud Service Account should be base64 encoded to a string and placed as an environment variable.

deployment

To deploy this telegram bot you will first have to follow the first few steps of this tutorial till you generate the telegram bot token to put in your environment variable.

Then set up ngrok as shown here and create a tunnel to port 8000 using the CLI ngrok http 8000.

Then run the make command make launch_bot and it should work.

Future Improvements

🟢 - Done, 🟡 In Progress, 🔴 - Yet To Implement

🟡 Retraining Script

🟡 Cloud deployment on GCP Kubecluster with docker images

🔴 Training on Cloud Compute

🔴 Data Collection from users, feedback, and retraining (depending on the data collection policies)

Relevant Reading Materials

Neptune overview:

Pytorch lightning overview:

About

MLOps pipeline for the training and deployment of a CV model for dog recognition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published