The goal of the project:
- Train a model that will predict the temperature in Moscow for the next 24 hours in 3-hour increments. (8 predictions)
- Automate the temperature forecasting process based on new data.
Before start process locally, required to create database connection, use next settings (they set as default in project):
- Host - localhost
- Database - postgres
- Port - 5432
- Username - username
- Password - qwerty
pgAdmin4:
Make sure that you have user - username with Superuser
Dbeaver:
For start process locally, follow the steps below:
-
Download folder Airflow locally.
-
Make sure that docker is running and docker-engine has sufficient memory allocated.
Before run Airflow, prepare the environment by executing the following steps:
- If you are working on Linux, specify the AIRFLOW_UID by running the command:
echo -e "AIRFLOW_UID=$(id -u)" > .env
- Perform the database migration and create the initial user account by running the command:
docker compose up airflow-init
The created user account will have the login airflow
and the password airflow
.
- Start Airflow and build docker containers:
docker compose up --build -d
The project consists of the following parts:
- ETL – download historical temperature data in Moscow.
The data is downloaded from the website https://rp5.ru/Weather_archive_in_Moscow, and then loaded into Postgresql. If a new line is added (new temperature), DAG Predict is launched.
- Train model – model training pipeline. Optional part, because the model has already been trained.
- Predict – temperature prediction in Moscow for the next 24 hours. Triggered when a new temperature row is added in the ETL step.
Apache Аirflow is used as an orchestrator.
ETL and Predict are loaded as DAGs in Apache Airflow.
Train model is located in a separate folder.
- Init browser – set options for the driver, indicate the saving path for the downloaded files. Open the browser, return driver.
- Download archive – Go to the website https://rp5.ru/Weather_archive_in_Moscow, enter the weather station - 27612, enter the date range. Download the archive. Returning the path to the archive.
- Unzip Archive – unzip the archive and return the path to the Excel file.
- Preprocess data – We read the excel file and return a dataframe of historical data.
If there is no 'weather' table in the Database (first run or table deleted) then:
- Create db table – create a table weather. Loading historical data.
- Update db table – Load new data into the ‘weather’ table if there is any.
- Start – run docker container or run locally in IDE.
- Load raw data – get the data from the database required for training the model. By default, all rows available in the database are retrieved. To change the data used, CONFIG.py has the fields 'date_from' and 'date_to'. Use them for reproducible experiments.
- Preprocess data – Fill NaN values, create features, create targets (24 hours, 8 columns with 3 hour increments), divide the data into train/val/test.
- Tune model – Optional stage. By default, hyperparameters are already defined in CONFIG.py. If you want to tune hyperparameters yourself, uncomment the part with tune_model. After tunning, the tunned hyperparameters will be used when training the model.
- Train model – train the model, calculate MAE, save the model in the output folder, save training information in mlflow.
- Start – event based. It is launched after the DAG ETL has completed and a new row with data has been added to the ‘weather’ datatable.
If there is no table 'weather_predictions' in the Database (first run or table deleted) then:
- Create db table – create a ‘weather_predictions’ table.
- Load model – load the trained model.
- Load raw data – get the data from the ‘weather’ datatable, that required to predict the temperature of the next 24 hours.
- Preprocess data – Fill NaN if any, create the dataframe necessary for prediction.
- Predict – predict the weather for the next 24 hours.
Postprocess data – convert prediction results into a dataframe format.
Insert to db table – enter predicted values into the ‘weather_predictions’ datatable.
After step 'Installation' is completed, follow for the next steps:
-
Access the Airflow web interface in your browser at http://localhost:8080.
-
Login as Username -
airflow
, password -airflow
- Turn DAG
Weather_ETL
, wait until it finishes. It will create tableweather
in PostgreSQL with historical weather data.
- Turn DAG
Weather_prediction
, it will be triggered byWeather_ETL
. DAG will create tableweather_predictions
where will be predictions for the next 24 hours.
DAG Weather_ETL
will be triggered every 3 hours and will check if new historical data appeared. If new data appears, Weather_ETL
will trigger Weather_prediction
for making new predictions.
When you are finished working and want to clean up your environment, run:
docker compose down --volumes --rmi all
If you want to train model, follow next steps:
-
Download folder train model locally
-
In folder create virtual environment:
python3 -m venv env
- Choose created virtual environment:
source env/bin/activate
- Install required libs from requirements.txt:
pip install -r requirements.txt
- Run mlflow ui:
mlflow ui
-
Access the MLFlow web interface in your browser at http://127.0.0.1:5000.
-
Uncomment 'params = tune_model(file_dirs, CONFIG)' (Optional, for tunning your model)
-
Run main.py
After main.py finishes run, you will find results in MLFlow. Model will be in your-project-folder/output
Feel free to create new features in preprocess_data/feature_engineering
, tune your model, change data to use (CONFIG['date_from']
CONFIG['date_to']
)
The following columns were selected as the best columns for weather forecasting, their weights are presented below.
Chosen model: XGBRegressor
MAE: 1.008 (for 8200 test set datetime)
On a screeshot below represents table, where:
datetime - date and time from which the predictions was made
temp_X - real temperature after X hours after datetime
pred_temp_X - predicted temperature afrer X hours after datetime
MAE_X - error of prediction
MAE - mean error of all predictions from datetime (8 predictions for each 3 hours)