This repository contains a Python script that implements a data processing and machine learning pipeline to predict which patients will have a heart stroke. The pipeline includes data loading, preprocessing, feature engineering, model training, evaluation, and email notifications.
- Data loading from a CSV file
- Data cleaning and preprocessing
- Handling of missing values and outliers
- Standardizing numerical features
- Label encoding for categorical features
- RandomForestClassifier model for classification
- Evaluation metrics including accuracy, precision, recall, F1 score, and ROC AUC
- Email notifications for pipeline execution results or errors
- Python 3.x
- Clone the repository:
git clone https://github.com/moe94z/Heart_Stroke_Prediction.git cd Heart_Stroke_Prediction
- Install necessary packages (pull requirements.txt)
pip install -r requirements.txt
(contain all the necessary libraries)
python3 Prod_Heart_Stroke_Prediction.py > errors.log &
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'stroke_pipeline',
default_args=default_args,
description='A pipeline for stroke prediction',
schedule_interval=timedelta(days=1),
)
t1 = BashOperator(
task_id='run_pipeline',
bash_command='python3 /local/environment/prod/stroke_prediction_pipeline.py > /local/environment/prod/errors.log',
dag=dag,
)
t1
Launch airflow after placing dag in the dags folder and init the db
airflow webserver -p 8080
airflow scheduler