feature: new MLFlow module

DareData · Nov 3, 2023 · 351efd2 · 351efd2
1 parent a74e420
commit 351efd2
Show file tree

Hide file tree

Showing 30 changed files with 11,325 additions and 0 deletions.
diff --git a/Module05- MLFlow/MLFlow.md b/Module05- MLFlow/MLFlow.md
@@ -0,0 +1,69 @@
+#### This module was built based on DataTalksClub MLOps Course, which is open-source.
+
+# 5. Experiment tracking and model management
+
+
+* [Slides](https://drive.google.com/file/d/1YtkAtOQS3wvY7yts_nosVlXrLQBq5q37/view?usp=sharing)
+
+
+## 5.1 Experiment tracking intro
+
+<a href="https://www.youtube.com/watch?v=MiA7LQin9c8&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-01.jpg">
+</a>
+
+
+## 5.2 Getting started with MLflow
+
+<a href="https://www.youtube.com/watch?v=cESCQE9J3ZE&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-02.jpg">
+</a>
+
+Note: in the videos, Cristian uses Jupyter in VS code and runs everything locally
+
+But if you set up a VM in the previous module, you can keep using it
+and use the usual Jupyter from your browser. There's no significant
+difference between using Jupyter with VS code and without
+
+
+## 5.3 Experiment tracking with MLflow
+
+<a href="https://www.youtube.com/watch?v=iaJz-T7VWec&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-03.jpg">
+</a>
+
+
+
+## 5.4 Model management
+
+<a href="https://www.youtube.com/watch?v=OVUPIX88q88&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-04.jpg">
+</a>
+
+
+
+## 5.5 Model registry
+
+<a href="https://www.youtube.com/watch?v=TKHU7HAvGH8&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-05.jpg">
+</a>
+
+
+## 5.6 MLflow in practice
+
+<a href="https://www.youtube.com/watch?v=1ykg4YmbFVA&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-06.jpg">
+</a>
+
+
+## 5.7 MLflow: benefits, limitations and alternatives
+
+<a href="https://www.youtube.com/watch?v=Lugy1JPsBRY&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK">
+  <img src="images/thumbnail-2-07.jpg">
+</a>
+
+
+## %.7 Assignment/ HW
+
+More information on assignment folder
+
diff --git a/Module05- MLFlow/assignment/assignment.md b/Module05- MLFlow/assignment/assignment.md
@@ -0,0 +1,153 @@
+## Homework
+
+The goal of this homework is to get familiar with tools like MLflow for experiment tracking and 
+model management.
+
+
+## Q1. Install the package
+
+To get started with MLflow you'll need to install the appropriate Python package.
+
+For this we recommend creating a separate Python environment, for example, you can use [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-envs), 
+and then install the package there with `pip` or `conda`.
+
+Once you installed the package, run the command `mlflow --version` and check the output.
+
+What's the version that you have?
+
+
+## Q2. Download and preprocess the data
+
+We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip. 
+
+Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
+
+Use the script `preprocess_data.py` located in the folder [`homework`](homework) to preprocess the data.
+
+The script will:
+
+* load the data from the folder `<TAXI_DATA_FOLDER>` (the folder where you have downloaded the data),
+* fit a `DictVectorizer` on the training set (January 2022 data),
+* save the preprocessed datasets and the `DictVectorizer` to disk.
+
+Your task is to download the datasets and then execute this command:
+
+```
+python preprocess_data.py --raw_data_path <TAXI_DATA_FOLDER> --dest_path ./output
+```
+
+Tip: go to `02-experiment-tracking/homework/` folder before executing the command and change the value of `<TAXI_DATA_FOLDER>` to the location where you saved the data.
+
+So what's the size of the saved `DictVectorizer` file?
+
+* 54 Kb
+* 154 Kb
+* 54 Mb
+* 154 Mb
+
+
+## Q3. Train a model with autolog
+
+We will train a `RandomForestRegressor` (from Scikit-Learn) on the taxi dataset.
+
+We have prepared the training script `train.py` for this exercise, which can be also found in the folder `homework`. 
+
+The script will:
+
+* load the datasets produced by the previous step,
+* train the model on the training set,
+* calculate the RMSE score on the validation set.
+
+Your task is to modify the script to enable **autologging** with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked. 
+
+Tip 1: don't forget to wrap the training code with a `with mlflow.start_run():` statement as we showed in the videos.
+
+Tip 2: don't modify the hyperparameters of the model to make sure that the training will finish quickly.
+
+What is the value of the `max_depth` parameter:
+
+* 4
+* 6
+* 8
+* 10
+
+
+## Launch the tracking server locally for MLflow
+
+Now we want to manage the entire lifecycle of our ML model. In this step, you'll need to launch a tracking server. This way we will also have access to the model registry. 
+
+In case of MLflow, you need to:
+
+* launch the tracking server on your local machine,
+* select a SQLite db for the backend store and a folder called `artifacts` for the artifacts store.
+
+You should keep the tracking server running to work on the next three exercises that use the server.
+
+
+## Q4. Tune model hyperparameters
+
+Now let's try to reduce the validation error by tuning the hyperparameters of the `RandomForestRegressor` using `optuna`. 
+We have prepared the script `hpo.py` for this exercise. 
+
+Your task is to modify the script `hpo.py` and make sure that the validation RMSE is logged to the tracking server for each run of the hyperparameter optimization (you will need to add a few lines of code to the `objective` function) and run the script without passing any parameters.
+
+After that, open UI and explore the runs from the experiment called `random-forest-hyperopt` to answer the question below.
+
+Note: Don't use autologging for this exercise.
+
+The idea is to just log the information that you need to answer the question below, including:
+
+* the list of hyperparameters that are passed to the `objective` function during the optimization,
+* the RMSE obtained on the validation set (February 2022 data).
+
+What's the best validation RMSE that you got?
+
+* 1.85
+* 2.15
+* 2.45
+* 2.85
+
+
+## Q5. Promote the best model to the model registry
+
+The results from the hyperparameter optimization are quite good. So, we can assume that we are ready to test some of these models in production. 
+In this exercise, you'll promote the best model to the model registry. We have prepared a script called `register_model.py`, which will check the results from the previous step and select the top 5 runs. 
+After that, it will calculate the RMSE of those models on the test set (March 2022 data) and save the results to a new experiment called `random-forest-best-models`.
+
+Your task is to update the script `register_model.py` so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.
+
+Tips for MLflow:
+
+* you can use the method `search_runs` from the `MlflowClient` to get the model with the lowest RMSE,
+* to register the model you can use the method `mlflow.register_model` and you will need to pass the right `model_uri` in the form of a string that looks like this: `"runs:/<RUN_ID>/model"`, and the name of the model (make sure to choose a good one!).
+
+What is the test RMSE of the best model?
+
+* 1.885
+* 2.185
+* 2.555
+* 2.955
+
+
+## Q6. Model metadata
+
+Now explore your best model in the model registry using UI. What information does the model registry contain about each model?
+
+* Version number
+* Source experiment
+* Model signature
+* All the above answers are correct
+
+
+## Submit the results
+
+* Submit your results here: coming soon
+* You can submit your solution multiple times. In this case, only the last submission will be used
+* If your answer doesn't match options exactly, select the closest one
+
+
+## Deadline
+
+The deadline for submitting is 1 June 2023 (Thursday), 23:00 CEST (Berlin time). 
+
+After that, the form will be closed.
diff --git a/Module05- MLFlow/assignment/hpo.py b/Module05- MLFlow/assignment/hpo.py
@@ -0,0 +1,73 @@
+import os
+import pickle
+import click
+import mlflow
+import optuna
+
+from optuna.samplers import TPESampler
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.metrics import mean_squared_error
+
+mlflow.set_tracking_uri("sqlite:///mlflow.db")
+mlflow.set_experiment("random-forest-hyperopt")
+
+
+def load_pickle(filename):
+    with open(filename, "rb") as f_in:
+        return pickle.load(f_in)
+
+
+@click.command()
+@click.option(
+    "--data_path",
+    default="../output",
+    help="Location where the processed NYC taxi trip data was saved"
+)
+@click.option(
+    "--num_trials",
+    default=10,
+    help="The number of parameter evaluations for the optimizer to explore"
+)
+def run_optimization(data_path: str, num_trials: int):
+
+    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
+    X_val, y_val = load_pickle(os.path.join(data_path, "val.pkl"))
+
+    print("dataframes read.")
+
+    def objective(trial):
+        params = {
+            'n_estimators': trial.suggest_int('n_estimators', 10, 50, 1),
+            'max_depth': trial.suggest_int('max_depth', 1, 20, 1),
+            'min_samples_split': trial.suggest_int('min_samples_split', 2, 10, 1),
+            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 4, 1),
+            'random_state': 42,
+            'n_jobs': -1
+        }
+        with mlflow.start_run():
+            print("started mlflow run.")
+            #mlflow parameters
+            mlflow.set_tag("model", "xgboost")
+            mlflow.log_params(params)
+            mlflow.sklearn.autolog()
+
+            print("autolog activated.")
+
+            # instantiate model
+            rf = RandomForestRegressor(**params)
+            rf.fit(X_train, y_train)
+            y_pred = rf.predict(X_val)
+            rmse = mean_squared_error(y_val, y_pred, squared=False)
+
+            mlflow.log_metric("rmse", rmse)
+            print(f"Logged rmse of {rmse}")
+
+        return rmse
+
+    sampler = TPESampler(seed=42)
+    study = optuna.create_study(direction="minimize", sampler=sampler)
+    study.optimize(objective, n_trials=num_trials)
+
+
+if __name__ == '__main__':
+    run_optimization()
diff --git a/Module05- MLFlow/assignment/preprocess_data.py b/Module05- MLFlow/assignment/preprocess_data.py
@@ -0,0 +1,84 @@
+import os
+import pickle
+import click
+import pandas as pd
+
+from sklearn.feature_extraction import DictVectorizer
+
+
+def dump_pickle(obj, filename: str):
+    with open(filename, "wb") as f_out:
+        return pickle.dump(obj, f_out)
+
+
+def read_dataframe(filename: str):
+    df = pd.read_parquet(filename)
+
+    df['duration'] = df['lpep_dropoff_datetime'] - df['lpep_pickup_datetime']
+    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
+    df = df[(df.duration >= 1) & (df.duration <= 60)]
+
+    categorical = ['PULocationID', 'DOLocationID']
+    df[categorical] = df[categorical].astype(str)
+
+    return df
+
+
+def preprocess(df: pd.DataFrame, dv: DictVectorizer, fit_dv: bool = False):
+    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
+    categorical = ['PU_DO']
+    numerical = ['trip_distance']
+    dicts = df[categorical + numerical].to_dict(orient='records')
+    if fit_dv:
+        X = dv.fit_transform(dicts)
+    else:
+        X = dv.transform(dicts)
+    return X, dv
+
+
+@click.command()
+@click.option(
+    "--raw_data_path",
+    help="Location where the raw NYC taxi trip data was saved"
+)
+@click.option(
+    "--dest_path",
+    help="Location where the resulting files will be saved"
+)
+def run_data_prep(raw_data_path: str, dest_path: str, dataset: str = "green"):
+    # Load parquet files
+    df_train = read_dataframe(
+        os.path.join(raw_data_path, f"{dataset}_tripdata_2022-01.parquet")
+    )
+    df_val = read_dataframe(
+        os.path.join(raw_data_path, f"{dataset}_tripdata_2022-02.parquet")
+    )
+    df_test = read_dataframe(
+        os.path.join(raw_data_path, f"{dataset}_tripdata_2022-03.parquet")
+    )
+
+    # Extract the target
+    target = 'tip_amount'
+    y_train = df_train[target].values
+    y_val = df_val[target].values
+    y_test = df_test[target].values
+
+    # Fit the DictVectorizer and preprocess data
+    dv = DictVectorizer()
+    X_train, dv = preprocess(df_train, dv, fit_dv=True)
+    X_val, _ = preprocess(df_val, dv, fit_dv=False)
+    X_test, _ = preprocess(df_test, dv, fit_dv=False)
+
+    # Create dest_path folder unless it already exists
+    os.makedirs(dest_path, exist_ok=True)
+
+    # Save DictVectorizer and datasets
+    dump_pickle(dv, os.path.join(dest_path, "dv.pkl"))
+    dump_pickle((X_train, y_train), os.path.join(dest_path, "train.pkl"))
+    dump_pickle((X_val, y_val), os.path.join(dest_path, "val.pkl"))
+    dump_pickle((X_test, y_test), os.path.join(dest_path, "test.pkl"))
+
+
+if __name__ == '__main__': #pragma-no-cover
+    run_data_prep()
+