Filip Schjerven 2024-01-24
Dissemination and external validation of hypertension risk models is important, but rarely accommodated for in the literature1. This is especially true when machine learning models are being used, as they are often too complex to describe in traditional appendices. A recent systematic review details that only 4 of 13 studies developing risk models using machine learning made their developed model available, and only in the cases where the model could be presented graphically1.
We used machine learning to predict risk of hypertension 11 years later on data from 17 852 HUNT Study participants, and provide the machine learning models developed using the XGBoost, Random Forest, and Elastic regression methods23456. Details on data flow, model development, and data accessibility can be found within the associated publication2.
In total, we provide 13 risk models:
- An XGBoost model,
- an elastic regression model,
- a Random Forest model,
- a smaller logistic regression model used as a reference,
- a ‘high normal BP’ decision rule,
- an adaptation of the externally developed Framingham risk model to the HUNT Study data, and
- an adaptation of the Framingham risk model recalibrated on the HUNT Study data
- six more externally developed risk models that were applicable on our data, developed from Chinese, Iranian and Korean populations
Two more modelling methods were applied in our study: The K-Nearest Neighbour and the SVM methods. These models are not provided here, as they could not be detached from the development data which we do not have permission to share. Data can be obtained upon approval from REK and HUNT Research Centre. For more information see: www.ntnu.edu/hunt/data.
# library(caret) #version 6.0.93
# library(xgboost) #version 1.7.5.1
# library(RRF) #version 1.9.4
# library(randomForest) #version 4.7.1.1
# library(recipes) #version 1.0.1
load("git_models.rds")
load("git_prep.rds")
source("helper_functions.r")
source("auxiliary_models.r")
We include some example data to illustrate the input data for the different models. All variables and levels are represented. See the development-article for information on variables and how they were recorded and constructed. Note that the example data is simulated, i.e., not real data, and is only included to demonstrate how the resources in this repository may be used.
load("git_example_data.rds")
head(example_data, 2)
## # A tibble: 2 × 19
## Sex Height BMI SysBP DiaBP SeChol SeHDLChol SeTrig SeGluNonFast SeCreaCorr
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fema… 163 25 130 81 4.8 1.2 1.35 5.2 61
## 2 male 181 23.3 135 71 8.2 0.7 0.8 4.7 65
## # ℹ 9 more variables: GFREstStag <fct>, Age <dbl>, FamCVDHist <fct>,
## # SmokingStatus <fct>, Education <fct>, FamHypHist <fct>, HypOutcome <fct>,
## # PhysAct <fct>, LoveStat <fct>
Variable | Type | Levels |
---|---|---|
Sex | factor | female, male |
Height | numeric | - |
BMI | numeric | - |
SysBP | numeric | - |
DiaBP | numeric | - |
SeChol | numeric | - |
SeHDLChol | numeric | - |
SeTrig | numeric | - |
SeGluNonFast | numeric | - |
SeCreaCorr | numeric | - |
GFREstStag | factor | stage 1, stage 2-5 |
Age | numeric | - |
FamCVDHist | factor | no, yes |
SmokingStatus | factor | never, formerly, currently |
Education | factor | secondary, upper_secondary, high_school, college, post_graduate |
FamHypHist | factor | no, yes |
PhysAct | factor | low, mid, high |
LoveStatus | factor | married, never married, divorced |
HypOutcome | ordered factor | Not hypertension < Hypertensive |
The ‘prep’ object found in ‘git_prep.rds’ contains two pre-processing functions, which paired with the ‘dummies_to_categorical()’ function provides all preprocessing needed for the ML models included. The steps needed for the various models are as follows:
models | Preprocessing needed |
---|---|
Elastic regression | Standardization (numerical)* + Dummification (categorical) |
XGBoost | Standardization (numerical)* |
Random Forest | Standardization (numerical)* |
Decision rule | - |
Logistic regression reference model | - |
Framingham risk model | - |
Recalibrated Framingham risk model | - |
Chinese clinical risk model | - |
Chinese clinical risk model, from individuals without diabetes at baseline | - |
KoGES model | - |
TLGS model | - |
CAVAS model | - |
F-CAVAS model | - |
* Standardization to the mean and standard deviation of the training set, see the development article. These are already included in the ‘prep’ object.
The following example show how to predict using the example data for all models:
#Elastic regression
prepped_example_data <- bake(prep, example_data)
model <- models$elastic
y_hat <- predict(model, newdata=prepped_example_data, type="prob")
## XGBoost or Random forest
prepped_example_data <- bake(prep, example_data) %>% dummies_to_categorical()
model <- models$xgb #xgb, rf
y_hat <- predict(model, newdata=prepped_example_data, type="prob")
#auxiliary models, note the use of 'example_data', i.e., unprepped
#External models: chinese_risk_model, chinese_risk_model_no_diabetes, koges, tlgs, cavas, f_cavas, framingham
#other: logreg_reference, high_normal_bp_rule, recalibrated_framingham
model <- logreg_reference
y_hat <- predict(model, newdata=example_data)
head(y_hat, n=2)
## Normotensive Hypertensive
## 1 0.7570686 0.2429314
## 2 0.8176177 0.1823823
To encourage the use of existing models on hypertension risk modelling, we searched the literature for risk models suitable for our time-frame with variables that could be adapted to those found in the HUNT Study Data. We found seven models from five articles: The Framingham risk model7, the Chinese risk models8, the KoGES model9, the TLGS model10, and the CAVAS and F-CAVAS models11. To encourage reproduction and further external validation, we include the models used here. The only needed features are Age, systolic BP, diastolic BP, BMI, sex, smoking status and family history of hypertension. A recalibrated version of the Framingham risk model is also included. A meta-analysis of external validations of the Framingham risk model by other researchers can be found in1.
Details on the adaptations made to the risk models are described in the development article2.
We include a formatting function for easy use of caret-package and custom metrics. We included the scaled Brier Score and Integrated Calibration Index as custom functions121314. See, e.g., https://topepo.github.io/caret/measuring-performance.html for more performance metrics. Note that since the example data is only for demonstration, the performance measures are shown for illustrative purposes only and do not reflect real performance.
preds <- format_predictions(outcome=prepped_example_data$HypOutcome, predicted_probs=y_hat)
#Area under the receiver operator curve:
#twoClassSummary throws error for tibbles input
twoClassSummary(as.data.frame(preds), lev = levels(preds$obs))
## ROC Sens Spec
## 0.5191888 0.6938776 0.3962264
#Scaled Brier score, Integrated Calibration Index, Area under the receiver operator curve
totalSummary(preds, lev = levels(preds$obs))
## sBrier Score ICI AUC
## -0.09612304 0.09939779 0.51918881
1. Schjerven, F. E., Lindseth, F. & Steinsland, I. Prognostic risk models for incident hypertension: A PRISMA systematic review and meta-analysis. PLOS ONE 19, e0294148 (2024).
2. Schjerven, F. E., Ingeström, E. M. L., Steinsland, I. & Lindseth, F. Development of risk models of incident hypertension using machine learning on the HUNT study data. Scientific Reports 14, 5609 (2024).
3. C�svold, B. O. et al. Cohort Profile Update: The HUNT Study, Norway. International Journal of Epidemiology (2022) doi:10.1093/ije/dyac095.
4. Chen, T. et al. Xgboost: Extreme Gradient Boosting. (2021).
5. Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
6. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320 (2005).
7. Parikh, N. I. et al. A risk score for predicting near-term incidence of hypertension: The Framingham Heart Study. Annals of internal medicine 148, 102–110 (2008).
8. Chien, K.-L. et al. Prediction models for the risk of new-onset hypertension in ethnic Chinese in Taiwan. Journal of human hypertension 25, 294–303 (2011).
9. Lim, N.-K., Lee, J.-W. & Park, H.-Y. Validation of the Korean Genome Epidemiology Study Risk Score to Predict Incident Hypertension in a Large Nationwide Korean Cohort. Circulation journal : official journal of the Japanese Circulation Society 80, 1578–1582 (2016).
10. Koohi, F. et al. Validation of the Framingham hypertension risk score in a middle eastern population: Tehran lipid and glucose study (TLGS). BMC PUBLIC HEALTH 21, (2021).
11. Namgung, H. K. et al. Development and validation of hypertension prediction models: The Korean Genome and Epidemiology Study_cardiovascular Disease Association Study (KoGES_CAVAS). Journal of human hypertension (2022) doi:10.1038/s41371-021-00645-x.
12. Austin, P. C. & Steyerberg, E. W. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Statistics in Medicine 38, 4051–4065 (2019).
13. Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1–3 (1950).
14. Steyerberg, E. W. et al. Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology (Cambridge, Mass.) 21, 128–138 (2010).