Temperature prediction in Moscow for 24 hours

Description

The goal of the project:

Train a model that will predict the temperature in Moscow for the next 24 hours in 3-hour increments. (8 predictions)
Automate the temperature forecasting process based on new data.

Installation
- PostgreSQL
- Airflow (ETL + Predict)
Project diagram
- Project Description
ETL
- ETL process diagram
- Description of steps
Train_model
- Train model process diagram
- Description of steps
Predict
- Predict process diagram
- Description of steps
Run Process
- Airflow (ETL+Predict)
- Train model (optional)
Result
- Example of predictions

Installation

PostgreSQL

Before start process locally, required to create database connection, use next settings (they set as default in project):
- Host - localhost
- Database - postgres
- Port - 5432
- Username - username
- Password - qwerty

pgAdmin4:

Make sure that you have user - username with Superuser

Dbeaver:

Airflow (ETL + Predict)

For start process locally, follow the steps below:

Download folder Airflow locally.
Make sure that docker is running and docker-engine has sufficient memory allocated.

Before run Airflow, prepare the environment by executing the following steps:

If you are working on Linux, specify the AIRFLOW_UID by running the command:

echo -e "AIRFLOW_UID=$(id -u)" > .env

Perform the database migration and create the initial user account by running the command:

docker compose up airflow-init

The created user account will have the login airflow and the password airflow.

Start Airflow and build docker containers:

docker compose up --build -d

Project diagram

Project Description

The project consists of the following parts:

- ETL – download historical temperature data in Moscow.
The data is downloaded from the website https://rp5.ru/Weather_archive_in_Moscow, and then loaded into Postgresql. If a new line is added (new temperature), DAG Predict is launched.

- Train model – model training pipeline. Optional part, because the model has already been trained.

- Predict – temperature prediction in Moscow for the next 24 hours. Triggered when a new temperature row is added in the ETL step.

Apache Аirflow is used as an orchestrator.

ETL and Predict are loaded as DAGs in Apache Airflow.

Train model is located in a separate folder.

ETL

ETL process diagram

Description of steps

- Init browser – set options for the driver, indicate the saving path for the downloaded files. Open the browser, return driver.

- Download archive – Go to the website https://rp5.ru/Weather_archive_in_Moscow, enter the weather station - 27612, enter the date range. Download the archive. Returning the path to the archive.

- Unzip Archive – unzip the archive and return the path to the Excel file.

- Preprocess data – We read the excel file and return a dataframe of historical data.

If there is no 'weather' table in the Database (first run or table deleted) then:
- Create db table – create a table weather. Loading historical data.

- Update db table – Load new data into the ‘weather’ table if there is any.

Train model

Train model process diagram

Description of steps

- Start – run docker container or run locally in IDE.

- Load raw data – get the data from the database required for training the model. By default, all rows available in the database are retrieved. To change the data used, CONFIG.py has the fields 'date_from' and 'date_to'. Use them for reproducible experiments.

- Preprocess data – Fill NaN values, create features, create targets (24 hours, 8 columns with 3 hour increments), divide the data into train/val/test.

- Tune model – Optional stage. By default, hyperparameters are already defined in CONFIG.py. If you want to tune hyperparameters yourself, uncomment the part with tune_model. After tunning, the tunned hyperparameters will be used when training the model.

- Train model – train the model, calculate MAE, save the model in the output folder, save training information in mlflow.

Predict

Predict process diagram

Description of steps

- Start – event based. It is launched after the DAG ETL has completed and a new row with data has been added to the ‘weather’ datatable.

If there is no table 'weather_predictions' in the Database (first run or table deleted) then:
- Create db table – create a ‘weather_predictions’ table.

- Load model – load the trained model.

- Load raw data – get the data from the ‘weather’ datatable, that required to predict the temperature of the next 24 hours.

- Preprocess data – Fill NaN if any, create the dataframe necessary for prediction.

- Predict – predict the weather for the next 24 hours.

Postprocess data – convert prediction results into a dataframe format.

Insert to db table – enter predicted values into the ‘weather_predictions’ datatable.

Run Process

Airflow (ETL+Predict)

After step 'Installation' is completed, follow for the next steps:

Access the Airflow web interface in your browser at http://localhost:8080.
Login as Username - airflow, password - airflow

Turn DAG Weather_ETL, wait until it finishes. It will create table weather in PostgreSQL with historical weather data.

Turn DAG Weather_prediction, it will be triggered by Weather_ETL. DAG will create table weather_predictions where will be predictions for the next 24 hours.

DAG Weather_ETL will be triggered every 3 hours and will check if new historical data appeared. If new data appears, Weather_ETL will trigger Weather_prediction for making new predictions.

When you are finished working and want to clean up your environment, run:

docker compose down --volumes --rmi all

Train model (optional)

If you want to train model, follow next steps:

Download folder train model locally
In folder create virtual environment:

python3 -m venv env

Choose created virtual environment:

 source env/bin/activate

Install required libs from requirements.txt:

 pip install -r requirements.txt

Run mlflow ui:

 mlflow ui

Access the MLFlow web interface in your browser at http://127.0.0.1:5000.
Uncomment 'params = tune_model(file_dirs, CONFIG)' (Optional, for tunning your model)
Run main.py

After main.py finishes run, you will find results in MLFlow. Model will be in your-project-folder/output

Feel free to create new features in preprocess_data/feature_engineering, tune your model, change data to use (CONFIG['date_from'] CONFIG['date_to'])

Result

The following columns were selected as the best columns for weather forecasting, their weights are presented below.

Chosen model: XGBRegressor
MAE: 1.008 (for 8200 test set datetime)

Example of predictions

On a screeshot below represents table, where:
datetime - date and time from which the predictions was made
temp_X - real temperature after X hours after datetime
pred_temp_X - predicted temperature afrer X hours after datetime
MAE_X - error of prediction
MAE - mean error of all predictions from datetime (8 predictions for each 3 hours)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!