Dataset can be downloaded from kaggle at here
This is the implementation of capstone project for mlops-zoomcamp from DataTalksClub. The project provides the diabetes prediction service for the patients in the hospital by using the patients' medical information. Let's imagine that the aim is to predict whether the patient has the diabetes or not and take necessary actions based on the result.
The main focus of the project is to apply the MLops principles like experiment tracking, training pipeline, model monitoring concepts to the machine learning projects rather than getting state-of-the-art accuracy.
You can see the complete system design below.
- Name - Min Khant Maung Maung
- Email - [email protected]
I have reproduced this process on ubuntu 22.04 on EC2 instances with python 3.9.
When you tried to run this repository on the cloud servers like EC2 instances, we need to do the following steps:
sudo apt update
sudo apt install make git
sudo apt-get install libpq-dev python3-dev
Since this project uses a lot of dockerized services, docker and docker compose are needed to be installed.
You need to follow these steps about Install using the apt repository from here. You also need to install the post installation activity of docker from here.
If you server didn't have pip, sudo apt install python3-pip
If you get the warning like WARNING: The scripts pip, pip3, pip3.10 and pip3.11 are installed in '/home/ubuntu/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location
.
Then you have
- Open ~/.bashrc by
nano ~/.bashrc
- Add this line at the bottom of the file.
export PATH="$HOME/.local/bin:$PATH"
- Exit from the editor and run to take effect.
source ~/.bashrc
You can install pipenv by
pip install -U pip
pip install pipenv --user
Clone the repository.
git clone https://github.com/Michael95-m/mlops-capstone-project.git
Go to the root directory by cd mlops-capstone-project
.
This project is developed using python3.9. If your server didn't have python3.9, it's better to install it. You can install python3.9 by using anaconda's conda environment.
You can install anaconda like that and create a conda environment like that.
Then create an environment with pipenv (replace with the python path in the server)
pipenv --python=/path/to/anaconda/environment/python3.9
Then install dependencies by using
pipenv install --dev
Warning: The following instruction needed to be run at the root directory of the repo.
This project uses several environment variables like aws credentials. You need to export it before running the project or the easiest way is to create the file named .env file at the root level of the directory.
[email protected]
EMAIL_PASSWORD=1234567
AWS_ACCESS_KEY_ID=abcdefg
AWS_SECRET_ACCESS_KEY=123456
AWS_DEFAULT_REGION=ap-southeast-1
BUCKET_NAME=mlops-capstone-project-bucket
OBJECT_NAME=diabetes_prediction_dataset.csv
You need to copy the above variables to the .env file and replace it with your own credentials.
- EMAIL_USERNAME and EMAIL_PASSWORD is used for creating email block.
- AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION is used for running training pipeline with s3 bucket.BUCKET_NAME and OBJECT_NAME are used to identity the bucket and the data placed inside it.
- If you don't specify these, you will get the errors in running with training pipeline with s3 bucket and creating monitoring pipeline.
Note: The following processes are needed to implement sequentially.
You want to know more about training pipeline, take a look at this readme
-
Run
make setup-model-registry
. (Do it only for the first time running and yon don't need to do it for next time.) -
It just create the folder named mnt inside the home directory to save the metrics and artifacts from mlflow_server service.
- Start the prefect server in another terminal to run the training pipeline. Start by
make prefect-server-start
.
-
Run the training pipeline which train XGBoost model and registry the best model in the experiment in the mlflow model registry.
-
You can run by
make run-training-pipeline
.
- To run the training pipeline with the data downloaded from s3 (additionally, it will also upload the training, validation and test data to the s3 bucket), you can use
make run-training-pipeline-s3
.
-
You need to create the workpool named train-pool by using
make create-workpool
. -
After that, you can deploy the training pipeline by using
make deploy-training-pipeline
. -
(This step is only needed for the one time if it succeed.)
-
In order to run the deployed training pipeline, you need to start a worker in a separate terminal by
make start-worker
. -
After starting the worker, you can run the deployed training pipeline by using
make run-deployed-training-pipeline
.
- It has the same behaviour as 3.1.. In order to run, use
make run-deployed-training-pipeline-s3
.
- The trained model will be deployed as HTTP service by using flask and gunicorn.
- Note: In order to deploy the model as a service, you need to run the training pipeline at least once to have the production model in the mlflow model registry. And you also need to up the mlflow_server service to access this model (which is already up if you follow the process above)
You want to know more about model serving, take a look at this readme
- You can start the diabetes-service by running
make start-diabetes-service
.
For model monitoring, the diabetes-service from model serving part and the mlflow_server service will be needed to be up.
You want to know more about model monitoring, take a look at this readme
-
You need to copy validation data named valid.parquet from the data folder to the data folder inside monitoring. Actually I have already copied for you.
-
Then start diabetes-service by
make start-diabetes-service
. If this service is already up, you don't need to do this. -
After that, run
make prepare-reference
to prepare reference data.
- All other services inside docker compose file will be needed to be up and you can do it by using
make start-all-services
.
-
You need to create the database named production to save the prediction result for monitoring purpose.
-
This prediction log will become the current data for checking the data drift. You can create it by using
make create-db
.
-
You have to send the simulation data to the monitoring api for the purpose of model monitoring. You can send it by using
send-data-monitoring-api
in another terminal or using creating of screen service. -
While sending the data, you can check the data inside the table named prediction_log inside the database.
-
You can check the data drift and target drift when there is a certain amount of data inside the table. You can check the data inside with adminer tool.
-
You can login the adminer by using
- postgresql for system
- db for server
- admin for username
- example for password
- production for database
-
Then go to the streamlit service and you will see the streamlit UI. Then click the Data Drift button to check the report about data drift.
-
Then click the Target Drift button to check the report about target drift.
- We can deploy the monitoring pipeline that can send an email as an alert if the drift is detected on current data. You can implement it by deploying the workflow in the prefect.
-
Create an email block by using
make create-email-block
. Before creating email block, you need to set environment variable named- EMAIL_USERNAME
- EMAIL_PASSWORD
-
The easiest way to set is using .env file. You can set the value inside of that file. EMAIL_PASSWORD is not your password; it's called the appword. You can check how to generate it at here
-
If you want to check the data drift for yesterday's data, just run
make run-monitoring-pipeline
. -
If you want to check for specific day's data, run
pipenv run python monitoring/send_alerts.py -d <day> -m <month> -y <year>
. You can replace , and as the date you want to check.
Eg. This command, pipenv run python monitoring/send_alerts.py -d 9 -m 8 -y 2023
will run the data drift check for 9 August 2023.
-
First, you need to deploy monitoring pipeline with
make deploy-monitoring-pipeline
. -
In order to run monitoring pipeline, you need to start workpool by
make start-worker
. -
You can run the deployed workflow by
make run-deployed-monitoring-pipeline
. And it will check the data drift for yesterday. -
For specific day, run
pipenv run prefect deployment run -p day=<day> -p month=<month> -p year=<year> send-alert/deploy_monitor
. You can replace it as you like.
You have already set up the mlops-pipeline. You can stop the services make stop-all-services
You can easily restart all these services for next time by running make start-all-services
. You don't need to repeat all these above steps to run.
You want to know more about test cases, take a look at this readme
- You can run unit test by
make run-unit-test
.
- You can run integration test by
make run-integration-test
. - In order to run integration test, you need to down all docker services. You can do this by
make stop-all-services
.
- You can run by
make quality-check
.
All these service except prefect can be started by using docker compose by make start-all-services
. If you run from the server instances like EC2, you need to place 127.0.0.1 with your IP address of the server
Service | Port | Interface | Description |
---|---|---|---|
mlflow_server | 5000 | 127.0.0.1 | MLflow experiment tracking and model registry |
prefect | 4200 | 127.0.0.1 | Prefect Workflow Orchestration |
diabetes_service | 5010 | 127.0.0.1 | Diabetes Prediction Service |
monitoring_service | 5020 | 127.0.0.1 | Monitoring Service (use the prediction service above and save the result) |
monitoring_db | 5432 | 127.0.0.1 | Postgresql Database |
monitoring_adminer | 8080 | 127.0.0.1 | Adminer Tools (to check inside database) |
streamlit_service | 8501 | 127.0.0.1 | Streamlit web service to visualize the data and target drift |
For the scoring purpose, if you want to find out which steps are implemented throughout this project, check this