Table of Contents
Cardiovascular disease (CVD) is a major cause of death worldwide. It is a group of diseases that includes heart disease, stroke, and other vascular diseases that affect the heart and blood vessels. The Framingham Heart Study, one of the most extensive epidemiological studies of CVD risk factors, is an ongoing cardiovascular study on the residents of the town Framingham, Massachusetts.
The objective of this project is to develop a classification model to predict the risk of Coronary Heart Disease (CHD) in 10 years for any given person. It also focuses on concluding an ideal set of methods at pre-processing of the data available in this context, to implement Machine Learning analyses.
The Framingham dataset consists of medical, behavioural and demographic data on 3390 residents from the town of Framingham, Massachussets. Each one of the features are considered a possible factor for prediction of a Coronary Heart Disease in the next 10 years. The following are the features for a particular resident:
- id: Personal identification number (Unique)
Demographic:
- sex: Male or Female (Nominal)
- age: Age of the patient (Continuous)
- education: no information provided (Ordinal assumed)
Behavioral:
- is_smoking: Whether or not the patient is a current smoker (Nominal)
- cigsPerDay: Number of cigarettes smoked by the person per day on average (Continuous)
Medical information:
- BPMeds: Whether or not the patient is on blood pressure medication (Nominal)
- prevalentStroke: Whether or not the patient previously had a stroke (Nominal)
- prevalentHyp: Whether or not the patient was hypertensive (Nominal)
- diabetes: Whether or not the patient has diabetes (Nominal)
- totChol: Total cholesterol level in mg/dL (Continuous)
- sysBP: systolic blood pressure in mmHg (Continuous)
- diaBP: diastolic blood pressure in mmHg (Continuous)
- BMI: Body Mass Index (Continuous)
- heartRate: Heart rate (Continuous)
- glucose: glucose level in mg/dL (Continuous). Assumed fasting glucose
Target variable to predict:
- TenYearCHD: 10 year risk of coronary heart disease - (Nominal)
Before model implementation, the data is made to go through a set of pre-processing methods. Five different datasets were generated from this, based on the different methods of pre-processing - one main dataset and four other iterations to check the sensitivity of model prediction power with respect to method of pre-processing.
Missing values were imputed realistically for minimal amount of bias into the dataset. For example, among the missing values in BPMeds which is a binary feature, only those were imputed with 1 (indicating they take medications) who have a more than optimum systolic or diastolic blood pressure level, the optimum level being defined on the basis of diabetes according to the NCBI
To reduce dimensionality and/or improve multicollinearity within the dataset, certain features are deemed redundant and some are combined/transformed to create new features, to form a final set of features on which rest of the analysis is performed. The two-sample chi-squared hypothesis test was performed on the categorical variables to conclude the redundant features due to low dependance with the feature to be predicted. The final set of features chosen in this dataset were - age, cigsPerDay, BPMeds, totChol, BMI, heartRate, MAP, and diabetes_grade.
- Mean Arterial Pressure was created combining the systolic and diastolic blood pressures
- glucose levels were categorised to create diabetes_grade with four classes, to handle the extreme outliers in the feature
- education, is_smoking, prevalentStroke, prevalentHyp, and diabetes were deemed redundant after thorough analysis of the features
While all of the data was in the range of possible and realistic values for each feature, some datapoints indicated extremely rare cases which are also medical emergencies. These "outliers" could hamper the model prediction power. Hence, to avoid loss of data, outliers were limitted to maximum and minimum values, through Winsorising.
The many hypotheses in all the above pre-processing steps were also tested by creating new datasets whenever an assumption is made. All-in-all, four new datasets were created, and each was considered an iteration to be finally compared with the original dataset. The iterations were:
- Dropping all data with missing values
- Using KNN Imputer for imputing the missing values of glucose
- Using Regression Imputer for imputing the missing values of glucose
- Trimming of the outliers instead of Winsorising
After all the above feature engineering, the dataset was split into train and test sets. A 80-20 split ratio was chosen since the total amount of data is less and more training data is provided to the models to learn from
Since the classes in the variable to be predicted were heavily imbalanced, SMOTE was used to balance the classes. Undersampling and Random Oversampling were avoided to prevent loss of data and overfitting of the models respectively
Finally, the datasets were scaled using the MinMaxScaler fitted to the training set.
Seven models were implemented on the scaled data. GridSearchCV was performed to tune the hyperparameters. RepeatedStratifiedKFold was used for Cross Validation. The models were:
- Logistic Regression
- Naive Bayes
- Decision Tree
- K-Nearest Neighbours
- Support Vector Machine
- Random Forest
- XGBoost
The Recall Score was chosen as the evaluation metric for comparison between models. In the context of cardiovascular risk prediction, it is important to identify individuals who are at high risk so that appropriate interventions can be taken to prevent or manage their risk. False negatives (i.e., individuals who are at high risk but are not correctly identified as such) can lead to missed opportunities for prevention or treatment, and may result in adverse health outcomes. Therefore, a high recall rate is desirable in cardiovascular risk prediction models.
The results for the original dataset were:
Model | Train Recall (%) | Test Recall (%) |
---|---|---|
Logistic Regression | 66.999 | 69.607 |
Naive Bayes | 51.584 | 52.941 |
Decision Tree | 90.187 | 87.255 |
KNN | 80.373 | 71.568 |
SVM | 70.777 | 73.529 |
Random Forest | 78.983 | 67.647 |
XGBoost | 80.199 | 77.451 |
On the basis of comparison of Test Recalls, the Decision Tree model performed the best for the original dataset. For each iteration, the maximum Test Recalls were lesser than this value, so the original dataset was chosen as the best pre-processing method for this project, and the Decision Tree as the best performing model.
The predictions from this model were explained, both locally (for each data point) and globally using SHaply Additive exPlanations (SHAP).
Overall, the Decision Tree proved to be the most productive model for prediction of cardiovascular risk than the other models in this context. This project demonstrates the importance of data preprocessing, feature engineering, and model selection in developing a successful classification model for cardiovascular risk prediction.
- The primary challenge in this project was right at the onset - to understand the context of the problem, the features, and how they all relate to a risk of heart disease. Understanding the features like Blood Pressure, Diabetes, Cholestrol, BMI etc and their correlation required discussions with those in the medical profession.
- Understanding the context helped overcome the second challenge of handling missing values and the outliers. The missing values need to be imputed realistically, while the outliers need to be analysed whether they are in the possible range of values for that feature. Several methods and combinations were tried, to arrive at an optimum set of methodologies.
- The final challenge was for tuning the parameters for model implementation. Since models like Random Forest take a long time to train, an ideal set of parameter values need to be arrived at to tune them to achieve maximum Recall.
For handling and manipulating data
For Visualisation
For Hypothesis testing, Pre-processing and Model training