Stroke: Prediction of risk analysis based on machine learning, exploratory data analysis (EDA) and statistical analysis.
This repository contains a comprehensive analysis of stroke prediction using the Stroke Prediction Dataset. The analysis includes exploratory data analysis (EDA), statistical testing, and machine learning predictions. The work is implemented in R, with RMarkdown for documentation and Flexdashboard for interactive visualization and insights. View it here
A simple visualization on how it looks:
-
Data Preprocessing:
- Handling missing values (NA) through imputation
- Renaming columns for clarity, removing redundant columns, and organizing data into categorical and numerical variables
- Converting categorical variables to factors for analysis
-
Exploratory Data Analysis (EDA):
- Visualizations and statistical summaries to understand variable distributions and relationships
-
Statistical Testing:
- Evaluation of associations between variables using appropriate statistical tests
-
Machine Learning Models implementation and assessment of:
- Random Forest
- Logistic Regression
- Gradient Boosting
-
Interactive Dashboard:
- A Flexdashboard for interactive data exploration and visualization. View Dashboard Here.
-
TESIS.Rmd
Main RMarkdown file with the full analysis, including data preprocessing, EDA, statistical tests, and model evaluations. -
DashboardTFM.Rmd
RMarkdown file that generates the Flexdashboard, providing an interactive interface for data exploration. -
stroke.csv
You can download it from Kaggle.
To run the analysis or explore the dashboard locally:
- Clone this repository:
bash git clone https://github.com/santi-souza/stroke-eda-ml.git
- Open the files:
Open TESIS.Rmd or DashboardTFM.Rmd in RStudio to view the analysis or render the Flexdashboard.
- Data Preprocessing:
Addressed missing values with imputation, clarified column names, removed unnecessary columns, and separated data into categorical (converted to factors) and numerical variables.
- Exploratory Data Analysis (EDA):
Generated visualizations and statistical summaries to identify trends and patterns in the data.
- Statistical Testing:
Performed hypothesis testing to assess relationships between various features and stroke occurrence.
- Machine Learning:
Built and evaluated predictive models to identify key stroke risk factors.
- Interactive Dashboard:
Developed an interactive Flexdashboard to enable users to explore the data and model results.
Contributions are welcome! If you’d like to suggest improvements or find any issues, feel free to open an issue or submit a pull request.