This study investigates the causal relationships between various diagnostic measurements (such as pregnancy, skin thickness, and age) and the onset of diabetes. A predictive model is developed to forecast the onset of diabetes based on a range of health and external factors. The investigation employs a variety of machine learning (ML) algorithms to model these relationships. Our study seeks to explore factors beyond glucose levels that may contribute to the diagnosis of diabetes. For instance, we demonstrate that BMI is a significant variable that plays a major role in the onset of diabetes. The findings of this study reveal the multifaceted nature of diabetes and highlight the importance of comprehensive health care. Future research efforts may benefit from broadening the study population beyond female Pina individuals and incorporating additional external factors such as lifestyle and genetics.
Diabetes is a chronic medical condition that affects the metabolic process of converting food into energy. Over time, uncontrolled diabetes can cause severe complications such as cardiovascular disease and vision impairment. High glucose levels can also result in vascular degradation, leading to cardiovascular collapse, and nerve malfunction, inducing short and long-term issues in the eyes, feet, heart, and kidneys. While there is no cure, current treatments focus on managing blood sugar levels with insulin.
The ML algorithms are able to find patterns between several independent variables to determine the likelihood of the development of diabetes. For the our research database, the most accurate model was trained using the Decision Tree algorithm. This is a supervised learning algorithms that can be used for solving regression and classification problems. The algorithm is aimed at predicting the likelihood of diabetes development based on the simple decision rules inferred from the training data. We compared the accuracy of the predictions with the models trained using Logistic Regression and Random Forest algorithm. We determined that a higher accuracy of the predictions using the Decision Tree algorithm is due to the specifics of the training dataset, because, in general, the Random Forest algorithm demonstrates the highest accuracy. To achieve a higher accuracy, the model can be trained using the Random Forest algorithm with more trees. However, higher accuracy requires more computation power.
Our primary dataset for the prediction model is a dataset from Kaggle that is originally from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). All patients are females of at least 21 years old of Pima Indian heritage. The variables of the dataset are the number of pregnancies, glucose level, blood pressure, skin thickness, insulin level, body mass index (BMI), diabetes pedigree function, and age. We used the following Python libraries: numpy, matplotlib, pandas, seaborn, and sklearn.