Framingham Cardiovascular Risk Prediction

Table of Contents

About the Project
Dataset Description
Feature Engineering and Data Pre-processing
Model Implementation
Model Evaluation
Results
Conclusion
Challenges Faced
Libraries Used
Contact

About the Project

Cardiovascular disease (CVD) is a major cause of death worldwide. It is a group of diseases that includes heart disease, stroke, and other vascular diseases that affect the heart and blood vessels. The Framingham Heart Study, one of the most extensive epidemiological studies of CVD risk factors, is an ongoing cardiovascular study on the residents of the town Framingham, Massachusetts.

The objective of this project is to develop a classification model to predict the risk of Coronary Heart Disease (CHD) in 10 years for any given person. It also focuses on concluding an ideal set of methods at pre-processing of the data available in this context, to implement Machine Learning analyses.

(back to top)

Dataset Description

The Framingham dataset consists of medical, behavioural and demographic data on 3390 residents from the town of Framingham, Massachussets. Each one of the features are considered a possible factor for prediction of a Coronary Heart Disease in the next 10 years. The following are the features for a particular resident:

id: Personal identification number (Unique)

Demographic:

sex: Male or Female (Nominal)
age: Age of the patient (Continuous)
education: no information provided (Ordinal assumed)

Behavioral:

is_smoking: Whether or not the patient is a current smoker (Nominal)
cigsPerDay: Number of cigarettes smoked by the person per day on average (Continuous)

Medical information:

BPMeds: Whether or not the patient is on blood pressure medication (Nominal)
prevalentStroke: Whether or not the patient previously had a stroke (Nominal)
prevalentHyp: Whether or not the patient was hypertensive (Nominal)
diabetes: Whether or not the patient has diabetes (Nominal)
totChol: Total cholesterol level in mg/dL (Continuous)
sysBP: systolic blood pressure in mmHg (Continuous)
diaBP: diastolic blood pressure in mmHg (Continuous)
BMI: Body Mass Index (Continuous)
heartRate: Heart rate (Continuous)
glucose: glucose level in mg/dL (Continuous). Assumed fasting glucose

Target variable to predict:

TenYearCHD: 10 year risk of coronary heart disease - (Nominal)

(back to top)

Feature Engineering and Data Pre-processing

Before model implementation, the data is made to go through a set of pre-processing methods. Five different datasets were generated from this, based on the different methods of pre-processing - one main dataset and four other iterations to check the sensitivity of model prediction power with respect to method of pre-processing.

Handling missing values

Missing values were imputed realistically for minimal amount of bias into the dataset. For example, among the missing values in BPMeds which is a binary feature, only those were imputed with 1 (indicating they take medications) who have a more than optimum systolic or diastolic blood pressure level, the optimum level being defined on the basis of diabetes according to the NCBI

Feature Manipulation

To reduce dimensionality and/or improve multicollinearity within the dataset, certain features are deemed redundant and some are combined/transformed to create new features, to form a final set of features on which rest of the analysis is performed. The two-sample chi-squared hypothesis test was performed on the categorical variables to conclude the redundant features due to low dependance with the feature to be predicted. The final set of features chosen in this dataset were - age, cigsPerDay, BPMeds, totChol, BMI, heartRate, MAP, and diabetes_grade.

Mean Arterial Pressure was created combining the systolic and diastolic blood pressures
glucose levels were categorised to create diabetes_grade with four classes, to handle the extreme outliers in the feature
education, is_smoking, prevalentStroke, prevalentHyp, and diabetes were deemed redundant after thorough analysis of the features

Handling outliers

While all of the data was in the range of possible and realistic values for each feature, some datapoints indicated extremely rare cases which are also medical emergencies. These "outliers" could hamper the model prediction power. Hence, to avoid loss of data, outliers were limitted to maximum and minimum values, through Winsorising.

Iterations

The many hypotheses in all the above pre-processing steps were also tested by creating new datasets whenever an assumption is made. All-in-all, four new datasets were created, and each was considered an iteration to be finally compared with the original dataset. The iterations were:

Dropping all data with missing values
Using KNN Imputer for imputing the missing values of glucose
Using Regression Imputer for imputing the missing values of glucose
Trimming of the outliers instead of Winsorising

Data Splitting, Balancing and Scaling

After all the above feature engineering, the dataset was split into train and test sets. A 80-20 split ratio was chosen since the total amount of data is less and more training data is provided to the models to learn from

Since the classes in the variable to be predicted were heavily imbalanced, SMOTE was used to balance the classes. Undersampling and Random Oversampling were avoided to prevent loss of data and overfitting of the models respectively

Finally, the datasets were scaled using the MinMaxScaler fitted to the training set.

(back to top)

Model Implementation

Seven models were implemented on the scaled data. GridSearchCV was performed to tune the hyperparameters. RepeatedStratifiedKFold was used for Cross Validation. The models were:

Logistic Regression
Naive Bayes
Decision Tree
K-Nearest Neighbours
Support Vector Machine
Random Forest
XGBoost

(back to top)

Model Evaluation

The Recall Score was chosen as the evaluation metric for comparison between models. In the context of cardiovascular risk prediction, it is important to identify individuals who are at high risk so that appropriate interventions can be taken to prevent or manage their risk. False negatives (i.e., individuals who are at high risk but are not correctly identified as such) can lead to missed opportunities for prevention or treatment, and may result in adverse health outcomes. Therefore, a high recall rate is desirable in cardiovascular risk prediction models.

(back to top)

Results

The results for the original dataset were:

Model	Train Recall (%)	Test Recall (%)
Logistic Regression	66.999	69.607
Naive Bayes	51.584	52.941
Decision Tree	90.187	87.255
KNN	80.373	71.568
SVM	70.777	73.529
Random Forest	78.983	67.647
XGBoost	80.199	77.451

On the basis of comparison of Test Recalls, the Decision Tree model performed the best for the original dataset. For each iteration, the maximum Test Recalls were lesser than this value, so the original dataset was chosen as the best pre-processing method for this project, and the Decision Tree as the best performing model.

The predictions from this model were explained, both locally (for each data point) and globally using SHaply Additive exPlanations (SHAP).

(back to top)

Conclusion

Overall, the Decision Tree proved to be the most productive model for prediction of cardiovascular risk than the other models in this context. This project demonstrates the importance of data preprocessing, feature engineering, and model selection in developing a successful classification model for cardiovascular risk prediction.

(back to top)

Challenges Faced

The primary challenge in this project was right at the onset - to understand the context of the problem, the features, and how they all relate to a risk of heart disease. Understanding the features like Blood Pressure, Diabetes, Cholestrol, BMI etc and their correlation required discussions with those in the medical profession.
Understanding the context helped overcome the second challenge of handling missing values and the outliers. The missing values need to be imputed realistically, while the outliers need to be analysed whether they are in the possible range of values for that feature. Several methods and combinations were tried, to arrive at an optimum set of methodologies.
The final challenge was for tuning the parameters for model implementation. Since models like Random Forest take a long time to train, an ideal set of parameter values need to be arrived at to tune them to achieve maximum Recall.

(back to top)

Libraries Used

For handling and manipulating data

For Visualisation

For Hypothesis testing, Pre-processing and Model training

(back to top)

Contact

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Input Data		Input Data
Output Data		Output Data
Trial Notebooks		Trial Notebooks
Wordcloud		Wordcloud
Classification_ML_Capstone_CVD_Risk_Prediction_Final.ipynb		Classification_ML_Capstone_CVD_Risk_Prediction_Final.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Framingham Cardiovascular Risk Prediction

About the Project

Dataset Description

Feature Engineering and Data Pre-processing

Handling missing values

Feature Manipulation

Handling outliers

Iterations

Data Splitting, Balancing and Scaling

Model Implementation

Model Evaluation

Results

Conclusion

Challenges Faced

Libraries Used

Contact

About

Releases

Packages

Languages

vahadruya/Capstone_Classification_Cardiovascular_Risk_Prediction

Folders and files

Latest commit

History

Repository files navigation

Framingham Cardiovascular Risk Prediction

About the Project

Dataset Description

Feature Engineering and Data Pre-processing

Handling missing values

Feature Manipulation

Handling outliers

Iterations

Data Splitting, Balancing and Scaling

Model Implementation

Model Evaluation

Results

Conclusion

Challenges Faced

Libraries Used

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages