Skip to content

Stroke: Statistical analysis of risk factors and creation of predictive models using machine learning

Notifications You must be signed in to change notification settings

santi-souza/stroke-eda-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stroke: Prediction of risk analysis based on machine learning, exploratory data analysis (EDA) and statistical analysis.

Overview

This repository contains a comprehensive analysis of stroke prediction using the Stroke Prediction Dataset. The analysis includes exploratory data analysis (EDA), statistical testing, and machine learning predictions. The work is implemented in R, with RMarkdown for documentation and Flexdashboard for interactive visualization and insights. View it here

A simple visualization on how it looks:

dash3

Key Features

  • Data Preprocessing:

    • Handling missing values (NA) through imputation
    • Renaming columns for clarity, removing redundant columns, and organizing data into categorical and numerical variables
    • Converting categorical variables to factors for analysis
  • Exploratory Data Analysis (EDA):

    • Visualizations and statistical summaries to understand variable distributions and relationships
  • Statistical Testing:

    • Evaluation of associations between variables using appropriate statistical tests
  • Machine Learning Models implementation and assessment of:

    • Random Forest
    • Logistic Regression
    • Gradient Boosting
  • Interactive Dashboard:

Contents

  • TESIS.Rmd
    Main RMarkdown file with the full analysis, including data preprocessing, EDA, statistical tests, and model evaluations.

  • DashboardTFM.Rmd
    RMarkdown file that generates the Flexdashboard, providing an interactive interface for data exploration.

  • stroke.csv
    You can download it from Kaggle.

Setup and Usage

To run the analysis or explore the dashboard locally:

  1. Clone this repository:

bash git clone https://github.com/santi-souza/stroke-eda-ml.git

  1. Open the files:

Open TESIS.Rmd or DashboardTFM.Rmd in RStudio to view the analysis or render the Flexdashboard.

Analysis Highlights

  1. Data Preprocessing:

Addressed missing values with imputation, clarified column names, removed unnecessary columns, and separated data into categorical (converted to factors) and numerical variables.

  1. Exploratory Data Analysis (EDA):

Generated visualizations and statistical summaries to identify trends and patterns in the data.

  1. Statistical Testing:

Performed hypothesis testing to assess relationships between various features and stroke occurrence.

  1. Machine Learning:

Built and evaluated predictive models to identify key stroke risk factors.

  1. Interactive Dashboard:

Developed an interactive Flexdashboard to enable users to explore the data and model results.

Contributing

Contributions are welcome! If you’d like to suggest improvements or find any issues, feel free to open an issue or submit a pull request.