This repository contains the code and results of a comprehensive study on the classification of breast cancer using machine learning methods, as part of a diploma thesis project. The data used in the study were extracted from fine-needle aspiration (FNA) samples, and a total of 13 machine learning algorithms were employed to classify the samples as benign or malignant.
-
Breast Cancer Testing
: Contains all the files from the early stages of the project where various experiments and tests were conducted. -
Cross Validation (Wrong)
: Contains the first completed attempt of the study, but this was an incorrect approach. The results of this attempt are not representative due to data leakage during cross-validation. -
Nested Cross Validation
: This is the final folder, in which nested cross-validation is used to optimize the parameters of the algorithms, resulting in unbiased results. In the inner loop, the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop, the metrics are estimated by averaging test set scores over several dataset splits.
The dataset used for this project was the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, which contains 569 samples and 30 features.
The data used in the study were extracted from fine-needle aspiration (FNA) samples, and consist of features calculated from the images of the samples. The data set used in this study was limited, and 10-fold cross validation was used for evaluating the performance of the algorithms. Additionally, a nested 5-fold cross validation was used for tuning the parameters of the algorithms.
The thesis emphasizes on the reduction of the number of features while maintaining high accuracy in the results. Three different feature sets were used, including one with all features, one with a subset of seven features, and one with features extracted using the Principal Component Analysis method. The results of the study provide insight into the most effective and efficient machine learning methods for breast cancer classification, as well as the impact of reducing the number of features on the performance of the algorithms.
The study employed 13 different machine learning algorithms, including Gaussian Naive Bayes, Linear & Quadratic Discriminant Analysis, Ridge Classifier, k-Nearest Neighbors, Support Vector Machines, Decision Tree, Random Forest, Gradient Tree Boosting, Adaboost & XGBoost, Stochastic Gradient Descent & Multi-Layer Perceptron. The performance of each algorithm was evaluated using the F1-score as the primary metric, which is a measure of the balance between precision and recall. Additional metrics such as accuracy, precision, and recall were also used.
The code in this repository was developed using Python 3.x and the following packages:
- Pandas
- Scikit-Learn
- NumPy
- SciPy
- Matplotlib
- Seaborn
- Clone the repository to your local machine.
- Install the required packages using
$pip install -r requirements.txt
- Open the Jupyter Notebook file in the repository and run the code cells.
- The results of the study and the performance of the algorithms can be found in the notebook.