This project focuses on predicting trip durations for the "Mi Bici" public bike-sharing system in Guadalajara, Mexico. We leverage machine learning models to analyze ride patterns and provide real-time trip duration predictions.
π Data Source: Mi Bici Open Data
π
Training Data: 2024 trip records
π
Testing Data: 2025 trip records
This project integrates:
- MLflow for experiment tracking & model versioning.
- FastAPI for real-time trip duration predictions.
- Automated retraining when model performance degrades.
.
ββββapi
β ββββlogs # Stores logs from API requests
β ββββsaved_inputs # Stores API input data for retraining
β β
β β app.py # FastAPI application
β β README.md # API documentation
β β ss_api_test.JPG # Screenshot of Postman test
β
ββββconfigs
β β model1.yaml # Configuration file for model settings
β
ββββdata
β ββββexternal # External data sources (if applicable)
β ββββinterim # Intermediate processed data
β ββββprocessed
β β ββββ2024 # Processed dataset for training
β β β combined_2024.csv # Combined processed data
β β β test.csv # Test dataset
β β β train.csv # Training dataset
β β
β ββββraw
β ββββ2024 # Raw dataset from Mi Bici 2024
β β datos_abiertos_2024_01.csv
β β datos_abiertos_2024_02.csv
β β datos_abiertos_2024_03.csv
β β ...
β β datos_abiertos_2024_12.csv
β β
β β nomenclatura_2024_12.csv # Data dictionary
β
ββββdocs # Project documentation
β
ββββMLFlow
β ββββmlartifacts # Stores MLflow artifacts
β ββββmlruns # MLflow experiment tracking
β β
β β infer.py # Model inference script
β β README.md # MLflow documentation
β β requirements.txt # Dependencies for MLflow
β β retrain_model.py # Automated retraining script
β β ss_mlfow_dashboard.JPG # Screenshot of MLflow dashboard
β β train.py # Model training script
β
ββββmodels
β β lr_mae-286.9239_2025-01-29.pkl # Latest linear regression model
β β random_forest_mae-273.4929_2025-01-29.pkl # Latest random forest model
β
ββββnotebooks
β β 1.IRA_data_preprocessing.ipynb # Data cleaning notebook
β β 2.IRA_data_vizualization.ipynb # Exploratory Data Analysis (EDA)
β β 3.IRA_modeling.ipynb # Model training and evaluation
β β 4.IRA_retraining.ipynb # Model retraining analysis
β
ββββreferences # References and additional documentation
β
ββββreports
β ββββfigures
β β umap_2d_K3.png # Visualization of UMAP embeddings
β
ββββsrc
ββββdata
β β build_features.py # Feature engineering
β β cleaning.py # Data cleaning functions
β β ingestion.py # Data loading
β β labeling.py # Labeling for supervised learning
β β splitting.py # Train-test split
β β validation.py # Data validation
β
ββββmodels
β ββββmodel1
β β β dataloader.py # Loads data for training
β β β hyperparameters_tuning.py # Hyperparameter tuning
β β β model.py # Model definition
β β β predict.py # Inference script
β β β preprocessing.py # Preprocessing functions
β β β train.py # Training script
β
ββββvisualization
β β evaluation.py # Model evaluation scripts
β β exploration.py # Data exploration and visualization
β
β __init__.py
.env
.gitignore
LICENSE
Makefile
README.md
requirements.txt
- Uses Linear Regression to predict trip durations.
- Trained on 2024 Mi Bici data.
- Logs:
- Hyperparameters
- Performance metrics (MAE, RMSE)
- Model artifacts & schema in MLflow.
- Deploys the trained model as a REST API.
- Logs every request & prediction for monitoring.
- Saves input data (
saved_inputs/prediction_inputs.csv
) for future retraining.
- Monitors model performance on 2025 Mi Bici test data.
- Retrains the model if MAE increases beyond a threshold.
- Uses both stored API inputs & new Mi Bici data for retraining.
- Registers new models in MLflow if performance improves.
- Python (ML model + API)
- scikit-learn (Linear Regression + preprocessing)
- FastAPI (Real-time model inference)
- MLflow (Model tracking & versioning)
- pandas & NumPy (Data processing)
- joblib (Model serialization)
- Logging & CSV storage (For monitoring & retraining)
1οΈβ£ Train the Model (train.py
):
- Loads Mi Bici 2024 dataset β Preprocesses β Trains model β Logs in MLflow.
2οΈβ£ Deploy API (api.py
):
- Loads MLflow model β Serves predictions via FastAPI.
3οΈβ£ Log Predictions for Retraining:
- API logs input requests β Saves them for future training.
4οΈβ£ Monitor & Retrain (retrain_model.py
):
- Uses Mi Bici 2025 test data to check performance.
- If MAE degrades, merges new inputs + original data.
- Retrains & registers new model in MLflow.
Column Name | Description |
---|---|
Trip_Id | Unique trip identifier |
User_Id | Unique user identifier |
Gender | Gender of the user |
Year_of_Birth | Year of birth of the user |
Trip_Start | Start timestamp of the trip |
Trip_End | End timestamp of the trip |
Origin_Id | Origin bike station ID |
Destination_Id | Destination bike station ID |
Trip_Duration | Total trip duration in seconds |
Start_Hour | Hour of day when the trip started |
Start_DayOfWeek | Day of the week (Monday=0, Sunday=6) |
pip install -r requirements.txt
python scripts/train.py
This will train the model using Mi Bici 2024 data and log it in MLflow.
uvicorn scripts.api:app --host 0.0.0.0 --port 8000 --reload
The API will be available at http://127.0.0.1:8000
.
$headers = @{
"Content-Type" = "application/json"
}
$body = @{
"Year_of_Birth" = 1995
"Gender" = 1
"Origin_Id" = 10
"Destination_Id" = 50
"Start_Hour" = 14
"Start_DayOfWeek" = 2
} | ConvertTo-Json -Depth 10
$response = Invoke-RestMethod -Uri "http://127.0.0.1:8000/predict" -Method Post -Headers $headers -Body $body
Write-Output $response
curl -X POST "http://127.0.0.1:8000/predict" \
-H "Content-Type: application/json" \
-d '{"Year_of_Birth": 1995, "Gender": 1, "Origin_Id": 10, "Destination_Id": 50, "Start_Hour": 14, "Start_DayOfWeek": 2}'
mlflow ui
Check experiment logs & model versions at http://127.0.0.1:5000
.
Schedule retrain_model.py
to run daily (or as needed) using:
- Linux (Cron Job)
Add:
crontab -e
0 */12 * * * /usr/bin/python3 /path/to/retrain_model.py
- Windows (Task Scheduler)
- Set Trigger: Run every 12 hours.
- Set Action: Run:
python C:\path\to\retrain_model.py