GitHub - adityapat96/thyroid: Machine Learning Predictions for Thyroid Data Analysis

This project aims to make predictions on a thyroid dataset by performing data preprocessing and running machine learning algorithms. The primary objectives are to determine if patients are susceptible to thyroid surgery, hyperthyroidism, or hypothyroidism. The final results will be discussed, and possibilities for future work on the predictions will be explored.

Dataset:

The dataset used for this project can be found at: https://www.kaggle.com/datasets/emmanuelfwerr/thyroid-disease-data.

Data Preprocessing:

The following steps were performed for data preprocessing:

Imputed the Sex column, replacing "t" with "1" and "f" with "0." Rows with Sex as "-1" were dropped.
Imputed the binary columns, replacing "t" with "1" and "f" with "0."
Changed the data type of the binary columns to "int."
Replaced "?" with "0" in the blood test measurement columns.
Dropped rows with age less than 18 or greater than 98.
Cleaned the "target" column by mapping it to thyroid diagnosis and renamed it as "diagnosis_name."
Divided the data into age groups and calculated average values of the 5 blood tests for each group.

First Prediction: Thyroid Surgery

Models were run using Random Oversampling and Random Undersampling techniques. The following models were employed:

Random Oversample:

Logistic Regression
Random Forest
Cross Validation on Random Forest
Neural Networks

Random Undersample:

Logistic Regression
Random Forest
Neural Networks

The performance of oversampled models was generally better (around 98%) compared to undersampled models.

Second Prediction: Hyperthyroidism vs Hypothyroidism

The following models were run:

Decision Tree Classifier
Random Forest
Logistic Regression

For overall thyroid diagnosis, Decision Tree and Random Forest achieved around 99% accuracy, while Logistic Regression had slightly lower accuracy. Similarly, for hypothyroidism conditions, Decision Tree and Random Forest achieved around 99% accuracy, with Logistic Regression performing slightly lower.

Conclusion:

Oversampled models generally performed better than undersampled models due to the dataset's imbalanced nature.
The overall diagnosis prediction achieved high accuracy, while Logistic Regression had slightly lower performance compared to tree-based models.
The results may be influenced by the limited features in the original dataset and the prevalence of binary labels.
Future work can focus on additional features related to dates, more categorical features, patient-level data, and exploring the relationship between specific blood tests and hyperthyroidism/hypothyroidism.

Future Work/Research

Incorporate additional features related to dates for a more comprehensive analysis.
Introduce more categorical features rather than relying heavily on binary labels.
Shift the focus to patient-level data to gain deeper insights into individual cases.
Investigate the relationship between specific blood tests and their significance in machine learning models for predicting hyperthyroidism or hypothyroidism.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
First_prediction.ipynb		First_prediction.ipynb
Preliminary_Data.ipynb		Preliminary_Data.ipynb
README.md		README.md
Second_prediction.ipynb		Second_prediction.ipynb
thyroid.csv		thyroid.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

adityapat96/thyroid

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages