Skip to content

This project focuses on predicting trip durations for the "Mi Bici" public bike-sharing system in Guadalajara, Mexico. We leverage machine learning models to analyze ride patterns and provide real-time trip duration predictions.

License

Notifications You must be signed in to change notification settings

Ivanrs297/awesome-mlops-end-to-end

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🚲 Mi Bici Trip Duration Prediction Project

This project focuses on predicting trip durations for the "Mi Bici" public bike-sharing system in Guadalajara, Mexico. We leverage machine learning models to analyze ride patterns and provide real-time trip duration predictions.

πŸ”— Data Source: Mi Bici Open Data
πŸ“… Training Data: 2024 trip records
πŸ“… Testing Data: 2025 trip records


πŸ“Œ Project Overview

This project integrates:

  • MLflow for experiment tracking & model versioning.
  • FastAPI for real-time trip duration predictions.
  • Automated retraining when model performance degrades.

πŸ“‚ Project Structure

.
β”œβ”€β”€β”€api
β”‚   β”œβ”€β”€β”€logs                  # Stores logs from API requests
β”‚   β”œβ”€β”€β”€saved_inputs          # Stores API input data for retraining
β”‚   β”‚
β”‚   β”‚   app.py                # FastAPI application
β”‚   β”‚   README.md             # API documentation
β”‚   β”‚   ss_api_test.JPG       # Screenshot of Postman test
β”‚
β”œβ”€β”€β”€configs
β”‚   β”‚   model1.yaml           # Configuration file for model settings
β”‚
β”œβ”€β”€β”€data
β”‚   β”œβ”€β”€β”€external              # External data sources (if applicable)
β”‚   β”œβ”€β”€β”€interim               # Intermediate processed data
β”‚   β”œβ”€β”€β”€processed
β”‚   β”‚   β”œβ”€β”€β”€2024              # Processed dataset for training
β”‚   β”‚   β”‚   combined_2024.csv  # Combined processed data
β”‚   β”‚   β”‚   test.csv          # Test dataset
β”‚   β”‚   β”‚   train.csv         # Training dataset
β”‚   β”‚
β”‚   └───raw
β”‚       β”œβ”€β”€β”€2024              # Raw dataset from Mi Bici 2024
β”‚       β”‚   datos_abiertos_2024_01.csv
β”‚       β”‚   datos_abiertos_2024_02.csv
β”‚       β”‚   datos_abiertos_2024_03.csv
β”‚       β”‚   ...
β”‚       β”‚   datos_abiertos_2024_12.csv
β”‚       β”‚
β”‚       β”‚   nomenclatura_2024_12.csv  # Data dictionary
β”‚
β”œβ”€β”€β”€docs                      # Project documentation
β”‚
β”œβ”€β”€β”€MLFlow
β”‚   β”œβ”€β”€β”€mlartifacts           # Stores MLflow artifacts
β”‚   β”œβ”€β”€β”€mlruns                # MLflow experiment tracking
β”‚   β”‚
β”‚   β”‚   infer.py              # Model inference script
β”‚   β”‚   README.md             # MLflow documentation
β”‚   β”‚   requirements.txt      # Dependencies for MLflow
β”‚   β”‚   retrain_model.py      # Automated retraining script
β”‚   β”‚   ss_mlfow_dashboard.JPG # Screenshot of MLflow dashboard
β”‚   β”‚   train.py              # Model training script
β”‚
β”œβ”€β”€β”€models
β”‚   β”‚   lr_mae-286.9239_2025-01-29.pkl  # Latest linear regression model
β”‚   β”‚   random_forest_mae-273.4929_2025-01-29.pkl  # Latest random forest model
β”‚
β”œβ”€β”€β”€notebooks
β”‚   β”‚   1.IRA_data_preprocessing.ipynb   # Data cleaning notebook
β”‚   β”‚   2.IRA_data_vizualization.ipynb   # Exploratory Data Analysis (EDA)
β”‚   β”‚   3.IRA_modeling.ipynb             # Model training and evaluation
β”‚   β”‚   4.IRA_retraining.ipynb           # Model retraining analysis
β”‚
β”œβ”€β”€β”€references                 # References and additional documentation
β”‚
β”œβ”€β”€β”€reports
β”‚   β”œβ”€β”€β”€figures
β”‚   β”‚   umap_2d_K3.png        # Visualization of UMAP embeddings
β”‚
└───src
    β”œβ”€β”€β”€data
    β”‚   β”‚   build_features.py # Feature engineering
    β”‚   β”‚   cleaning.py       # Data cleaning functions
    β”‚   β”‚   ingestion.py      # Data loading
    β”‚   β”‚   labeling.py       # Labeling for supervised learning
    β”‚   β”‚   splitting.py      # Train-test split
    β”‚   β”‚   validation.py     # Data validation
    β”‚
    β”œβ”€β”€β”€models
    β”‚   β”œβ”€β”€β”€model1
    β”‚   β”‚   β”‚   dataloader.py  # Loads data for training
    β”‚   β”‚   β”‚   hyperparameters_tuning.py # Hyperparameter tuning
    β”‚   β”‚   β”‚   model.py       # Model definition
    β”‚   β”‚   β”‚   predict.py     # Inference script
    β”‚   β”‚   β”‚   preprocessing.py # Preprocessing functions
    β”‚   β”‚   β”‚   train.py       # Training script
    β”‚
    β”œβ”€β”€β”€visualization
    β”‚   β”‚   evaluation.py     # Model evaluation scripts
    β”‚   β”‚   exploration.py    # Data exploration and visualization
    β”‚
    β”‚   __init__.py

.env  
.gitignore  
LICENSE  
Makefile  
README.md  
requirements.txt  

1️⃣ Model Training & Logging (MLflow)

  • Uses Linear Regression to predict trip durations.
  • Trained on 2024 Mi Bici data.
  • Logs:
    • Hyperparameters
    • Performance metrics (MAE, RMSE)
    • Model artifacts & schema in MLflow.

2️⃣ Real-time Predictions (FastAPI)

  • Deploys the trained model as a REST API.
  • Logs every request & prediction for monitoring.
  • Saves input data (saved_inputs/prediction_inputs.csv) for future retraining.

3️⃣ Automated Model Retraining

  • Monitors model performance on 2025 Mi Bici test data.
  • Retrains the model if MAE increases beyond a threshold.
  • Uses both stored API inputs & new Mi Bici data for retraining.
  • Registers new models in MLflow if performance improves.

πŸ›  Technologies Used

  • Python (ML model + API)
  • scikit-learn (Linear Regression + preprocessing)
  • FastAPI (Real-time model inference)
  • MLflow (Model tracking & versioning)
  • pandas & NumPy (Data processing)
  • joblib (Model serialization)
  • Logging & CSV storage (For monitoring & retraining)

πŸš€ How It Works

1️⃣ Train the Model (train.py):

  • Loads Mi Bici 2024 dataset β†’ Preprocesses β†’ Trains model β†’ Logs in MLflow.

2️⃣ Deploy API (api.py):

  • Loads MLflow model β†’ Serves predictions via FastAPI.

3️⃣ Log Predictions for Retraining:

  • API logs input requests β†’ Saves them for future training.

4️⃣ Monitor & Retrain (retrain_model.py):

  • Uses Mi Bici 2025 test data to check performance.
  • If MAE degrades, merges new inputs + original data.
  • Retrains & registers new model in MLflow.

πŸ“ˆ Data Overview

Mi Bici Dataset (2024-2025)

Column Name Description
Trip_Id Unique trip identifier
User_Id Unique user identifier
Gender Gender of the user
Year_of_Birth Year of birth of the user
Trip_Start Start timestamp of the trip
Trip_End End timestamp of the trip
Origin_Id Origin bike station ID
Destination_Id Destination bike station ID
Trip_Duration Total trip duration in seconds
Start_Hour Hour of day when the trip started
Start_DayOfWeek Day of the week (Monday=0, Sunday=6)

πŸ“Œ Running the Project

1️⃣ Install Dependencies

pip install -r requirements.txt

2️⃣ Train & Log the Model

python scripts/train.py

This will train the model using Mi Bici 2024 data and log it in MLflow.

3️⃣ Run the FastAPI Server

uvicorn scripts.api:app --host 0.0.0.0 --port 8000 --reload

The API will be available at http://127.0.0.1:8000.

4️⃣ Make a Prediction

Using PowerShell

$headers = @{
    "Content-Type" = "application/json"
}

$body = @{
    "Year_of_Birth" = 1995
    "Gender" = 1
    "Origin_Id" = 10
    "Destination_Id" = 50
    "Start_Hour" = 14
    "Start_DayOfWeek" = 2
} | ConvertTo-Json -Depth 10

$response = Invoke-RestMethod -Uri "http://127.0.0.1:8000/predict" -Method Post -Headers $headers -Body $body
Write-Output $response

Using Curl

curl -X POST "http://127.0.0.1:8000/predict" \
     -H "Content-Type: application/json" \
     -d '{"Year_of_Birth": 1995, "Gender": 1, "Origin_Id": 10, "Destination_Id": 50, "Start_Hour": 14, "Start_DayOfWeek": 2}'

5️⃣ Monitor MLflow

mlflow ui

Check experiment logs & model versions at http://127.0.0.1:5000.

MLflow Dashboard

6️⃣ Schedule Automated Retraining

Schedule retrain_model.py to run daily (or as needed) using:

  • Linux (Cron Job)
    crontab -e
    Add:
    0 */12 * * * /usr/bin/python3 /path/to/retrain_model.py
  • Windows (Task Scheduler)
    • Set Trigger: Run every 12 hours.
    • Set Action: Run:
      python C:\path\to\retrain_model.py

About

This project focuses on predicting trip durations for the "Mi Bici" public bike-sharing system in Guadalajara, Mexico. We leverage machine learning models to analyze ride patterns and provide real-time trip duration predictions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published