Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Michael95-m committed Aug 21, 2023
1 parent 151ea93 commit 22d03ec
Show file tree
Hide file tree
Showing 2 changed files with 154 additions and 40 deletions.
187 changes: 151 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,69 +6,173 @@ Dataset can be downloaded from kaggle at [here](https://www.kaggle.com/datasets/

This is the implementation of capstone project for mlops-zoomcamp from [DataTalksClub](https://datatalks.club/). The project provides the **diabetes prediction service** for the patients in the hospital by using the patients' medical information. Let's imagine that the aim is to predict whether the patient has the diabetes or not and take necessary actions based on the result.

The main focus of the project is to apply the MLops principles like experiment tracking, training pipeline, model monitoring concepts to the machine learing projects rather than getting state-of-the-art accuracy.
The main focus of the project is to apply the MLops principles like experiment tracking, training pipeline, model monitoring concepts to the machine learning projects rather than getting state-of-the-art accuracy.

## Process Diagram

You can see the complete system design below.

![](docs/system_design.png)<br>

## Prerequisites

## Training Pipeline
When you tried to run this repository on the cloud servers like **EC2 instances**, we need to do the following steps:

```
sudo apt update
```

### 1. Installation of make and git

```
sudo apt install make git
```

### 2. Installation of necessary libries for sqlalchemy library

```
sudo apt-get install libpq-dev python3-dev
```

### 3. Installation of docker and docker compose

Since this project uses a lot of dockerized services, docker and docker compose are needed to be installed.

You need to follow these steps about **Install using the apt repository** from [here](https://docs.docker.com/engine/install/ubuntu/). You also need to install the post installation activity of docker from [here](https://docs.docker.com/engine/install/linux-postinstall/).

### 1. Install the library at the root level.
### 4. Installation of pip

If you server didn't have **pip**, ```sudo apt install python3-pip```

### 5. Solving Warning

If you get the warning like ```WARNING: The scripts pip, pip3, pip3.10 and pip3.11 are installed in '/home/ubuntu/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location```.

Then you have

1. Open **~/.bashrc** by

```shell
nano ~/.bashrc
```

2. Add this line at the bottom of the file.
```bash
pip install --user pipenv
pipenv install --python=3.9
export PATH="$HOME/.local/bin:$PATH"
```

3. Exit from the editor and run to take effect.
```shell
source ~/.bashrc
```

### 6. Installation of pipenv

You can install pipenv by
```
pip install -U pip
pip install pipenv --user
```

### 8. Cloning the git repository.

Clone the repository.
```shell
git clone https://github.com/Michael95-m/mlops-capstone-project.git
```

Go to the root directory by ```cd mlops-capstone-project```.



### 7. Creation of virtual environment using pipenv

This project is developed using **python3.9**. If your server didn't have python3.9, it's better to install it. You can install python3.9 by using anaconda's conda environment.

You can install **anaconda** like [that](https://linuxhint.com/install-anaconda-ubuntu-22-04/) and create a conda environment like [that](https://stackoverflow.com/questions/63216201/how-to-install-python-with-conda).

Then create an environment with **pipenv** (replace with the python path in the server)
```shell
pipenv --python=/path/to/anaconda/environment/python3.9
```

Then install dependencies by using
```shell
pipenv install --dev
```


**Warning:** The following instruction needed to be run at the **root directory** of the repo.

### 8. Dealing with environment variables

This project uses several environment variables like aws credentials. You need to export it before running the project or the easiest way is to create the file named **.env** file at the root level of the directory.

```
[email protected]
EMAIL_PASSWORD=1234567
AWS_ACCESS_KEY_ID=abcdefg
AWS_SECRET_ACCESS_KEY=123456
AWS_DEFAULT_REGION=ap-southeast-1
BUCKET_NAME=mlops-capstone-project-bucket
OBJECT_NAME=diabetes_prediction_dataset.csv
```

You need to copy the above variables to the .env file and replace it with your own credentials.

- **EMAIL_USERNAME** and **EMAIL_PASSWORD** is used for creating email block.
- **AWS_ACCESS_KEY_ID**, **AWS_SECRET_ACCESS_KEY**, **AWS_DEFAULT_REGION** is used for running training pipeline with s3 bucket.**BUCKET_NAME** and **OBJECT_NAME** are used to identity the bucket and the data placed inside it.
- If you don't specify these, you will get the errors in running with *training pipeline with s3 bucket* and *creating monitoring pipeline*.


## Training Pipeline

### 2. Setup model's registry requirments
### 1. Setup model's registry requirments

- Run `make setup-model-registry`. (Do it only for the first time running and yon don't need to do it for next time.)
- Run `make setup-model-registry`. (Do it only for the first time running and yon don't need to do it for next time.)

- It just create the folder named **mnt** inside the home directory to save the metrics and artifacts from mlflow_server service.

### 3. Start prefect server
### 2. Start prefect server

- Start the prefect server in **another terminal** to run the training pipeline. Start by `make prefect-server-start`.
- Start the prefect server in **another terminal** to run the training pipeline. Start by `make prefect-server-start`.

### 4. Run the training pipeline
### 3. Run the training pipeline

- Run the training pipeline which train **XGBoost** model and registry the best model in the experiment in the **mlflow model registry**.
- Run the training pipeline which train **XGBoost** model and registry the best model in the experiment in the **mlflow model registry**.

- You can run by `make run-training-pipeline`.

### 4.1. Run the training pipeline with the data from s3.(Optional)
### 3.1. Run the training pipeline with the data from s3.(Optional)

- To run the training pipeline with the data downloaded from s3 (additionally, it will also upload the training, validation and test data to the s3 bucket), you can use `make run-training-pipeline-s3`.


### 5. Deploy the training pipeline in prefect
### 4. Deploy the training pipeline in prefect

- You need to create the workpool named **train-pool** by using `make create-workpool`.
- You need to create the workpool named **train-pool** by using `make create-workpool`.

- After that, you can deploy the training pipeline by using `make deploy-training-pipeline`.

- (This step is only needed for the one time if it succeed.)

### 6. Run the deployed training pipeline
### 5. Run the deployed training pipeline

- In order to run the deployed training pipeline, you need to start a worker in a **separate terminal** by `make start-worker`.
- In order to run the deployed training pipeline, you need to start a worker in a **separate terminal** by `make start-worker`.

- After starting the worker, you can run the deployed training pipeline by using `make run-deployed-training-pipeline`.

### 6.1. Run the deployed training pipeline with s3 data
### 5.1. Run the deployed training pipeline with s3 data

- It has the same behaviour as **4.1.**. In order to run, use `make run-deployed-training-pipeline-s3`.

## Model Serving

- The trained model will be deployed as HTTP service by using *flask* and *gunicorn*.
- **Note:** In order to deploy the model as a service, you need to run the **training pipeline** at least once to have the production model in the mlflow model registry. And you also need to up the **mlflow_server** service to access this model.
- The trained model will be deployed as HTTP service by using *flask* and *gunicorn*.
- **Note:** In order to deploy the model as a service, you need to run the **training pipeline** at least once to have the production model in the mlflow model registry. And you also need to up the **mlflow_server** service to access this model.

### 1. Start the diabetes service
### 1. Start the diabetes service

- You can start the diabetes-service by running `make start-diabetes-service`.

Expand All @@ -78,40 +182,41 @@ For model monitoring, the **diabetes-service** from **model serving** part and t

### 1. Preparing reference data

- You need to copy validation data named **valid.parquet** from the data folder to the **data** folder inside **monitoring**.
- You need to copy validation data named **valid.parquet** from the data folder to the **data** folder inside **monitoring**. Actually I have already copied for you.

- Then start diabetes-service by `make start-diabetes-service`. If this service is already up, you don't need to do this.

- Then start diabetes-service by `make start-diabetes-service`.
- After that, run `make prepare-reference` to prepare reference data.

### 2. Start all other services inside docker-compose.yml

- All other services inside docker compose file will be needed to be up and you can do it by using `make start-all-services`.

### 3. Create the database named **production** in the postgresql service
### 3. Create the database named **production** in the postgresql service

- You need to create the database named **production** to save the prediction result for monitoring purpose.
- You need to create the database named **production** to save the prediction result for monitoring purpose.

- This prediction log will become the **current data** for checking the data drift. You can create it by using `make create-db`.

### 4. Send the simulation data to the monitoring api

- You have to send the **simulation data** to the monitoring api for the purpose of model monitoring. You can send it by using `make send-data-to-monitoring-api` in another terminal.
- You have to send the **simulation data** to the monitoring api for the purpose of model monitoring. You can send it by using `send-data-monitoring-api` in **another terminal** or using creating of **screen service**.

- While sending the data, you can check the data inside the table named **prediction_log** inside the database.

### 5. Check the data drift and target drift

- You can check the data drift and target drift when there is a certain amount of data inside the table. You can check the data inside with [**adminer** tool](http://localhost:8080).

- You can login the adminer by using
- You can check the data drift and target drift when there is a certain amount of data inside the table. You can check the data inside with [**adminer** tool](http://localhost:8080).

- You can login the adminer by using
- **postgresql** for system
- **db** for server
- **admin** for username
- **example** for password
- **production** for database
- **example** for password
- **production** for database
<br><br>

- Then go to the [streamlit service](http://localhost:8501) and you will see the streamlit UI. Then click the **Data Drift** button to check the report about **data drift**.
- Then go to the [streamlit service](http://localhost:8501) and you will see the streamlit UI. Then click the **Data Drift** button to check the report about **data drift**.

- Then click the **Target Drift** button to check the report about **target drift**.

Expand All @@ -134,18 +239,28 @@ For model monitoring, the **diabetes-service** from **model serving** part and t

- If you want to check for specific day's data, run `pipenv run python monitoring/send_alerts.py -d <day> -m <month> -y <year>`. You can replace <day>, <month> and <year> as the date you want to check.

Eg. This command, `pipenv run python monitoring/send_alerts.py -d 9 -m 8 -y 2023` will run the data drift check for **9 August 2023**.
Eg. This command, `pipenv run python monitoring/send_alerts.py -d 9 -m 8 -y 2023` will run the data drift check for **9 August 2023**.

### 6.3. Running the deployed monitoring pipeline

- First, you need to deploy monitoring pipeline with `make deploy-monitoring-pipeline`.
- First, you need to deploy monitoring pipeline with `make deploy-monitoring-pipeline`.

- In order to run monitoring pipeline, you need to start workpool by `make start-worker`.

- You can run the deployed workflow by `make run-deployed-monitoring-pipeline`. And it will check the data drift for yesterday.

- For specific day, run `pipenv run prefect deployment run -p day=<day> -p month=<month> -p year=<year> send-alert/deploy_monitor`. You can replace it as you like.

## Stopping and restarting the services

### 1. Stop all the running services

You have already set up the mlops-pipeline. You can stop the services ```make stop-all-services```

### 2. Restart all the running service

You can easily restart all these services for next time by running ```make start-all-services```. You don't need to repeat all these above steps to run.

## Testing

### 1. Check the unit test.
Expand All @@ -154,7 +269,7 @@ Eg. This command, `pipenv run python monitoring/send_alerts.py -d 9 -m 8 -y 2023

### 2. Check the integration test.

- You can run integration test by `make run-integration-test`.
- You can run integration test by `make run-integration-test`.
- In order to run integration test, you need to down all docker services. You can do this by `make stop-all-services`.

### 3. Check the quality of the code by linting tools.
Expand All @@ -163,7 +278,7 @@ Eg. This command, `pipenv run python monitoring/send_alerts.py -d 9 -m 8 -y 2023

## Services

All these service except prefect can be started by using **docker compose** by `make start-all-services`.
All these service except prefect can be started by using **docker compose** by `make start-all-services`.

| Service | Port | Interface | Description |
| --- | --- | --- | --- |
Expand All @@ -173,4 +288,4 @@ All these service except prefect can be started by using **docker compose** by `
| monitoring_service | 5020 | 127.0.0.1 | Monitoring Service (use the prediction service above and save the result) |
| monitoring_db | 5432 | 127.0.0.1 | Postgresql Database |
| monitoring_adminer | 8080 | 127.0.0.1 | Adminer Tools (to check inside database) |
| streamlit_service | 8501 | 127.0.0.1 | Streamlit web service to visualize the data and target drift |
| streamlit_service | 8501 | 127.0.0.1 | Streamlit web service to visualize the data and target drift |
7 changes: 3 additions & 4 deletions integration_test/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
if [[ -z "${GITHUB_ACTIONS}" ]]; then
diabetes_service=diabetes_service
else
diabetes_service=diabetes_test_service
diabetes_service=diabetes_test_service
fi

echo "Mlflow server and prediction service starting"
Expand All @@ -17,11 +17,10 @@ ERROR_CODE=$?

if [ ${ERROR_CODE} != 0 ]; then
docker compose logs ${diabetes_service}
docker compose down ${diabetes_service}
docker compose down ${mlflow_server} ${diabetes_service}
exit ${ERROR_CODE}
fi

docker compose down ${diabetes_service}
docker compose down ${mlflow_server} ${diabetes_service}

echo "Yayy!! Integration test passed"

0 comments on commit 22d03ec

Please sign in to comment.