Breast Cancer Classification with Machine Learning Methods

Introduction

This repository contains the code and results of a comprehensive study on the classification of breast cancer using machine learning methods, as part of a diploma thesis project. The data used in the study were extracted from fine-needle aspiration (FNA) samples, and a total of 13 machine learning algorithms were employed to classify the samples as benign or malignant.

Folders & Files

Breast Cancer Testing: Contains all the files from the early stages of the project where various experiments and tests were conducted.
Cross Validation (Wrong): Contains the first completed attempt of the study, but this was an incorrect approach. The results of this attempt are not representative due to data leakage during cross-validation.
Nested Cross Validation: This is the final folder, in which nested cross-validation is used to optimize the parameters of the algorithms, resulting in unbiased results. In the inner loop, the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop, the metrics are estimated by averaging test set scores over several dataset splits.

Data

The dataset used for this project was the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, which contains 569 samples and 30 features.

The data used in the study were extracted from fine-needle aspiration (FNA) samples, and consist of features calculated from the images of the samples. The data set used in this study was limited, and 10-fold cross validation was used for evaluating the performance of the algorithms. Additionally, a nested 5-fold cross validation was used for tuning the parameters of the algorithms.

Feature Selection & Dimensionality Reduction

The thesis emphasizes on the reduction of the number of features while maintaining high accuracy in the results. Three different feature sets were used, including one with all features, one with a subset of seven features, and one with features extracted using the Principal Component Analysis method. The results of the study provide insight into the most effective and efficient machine learning methods for breast cancer classification, as well as the impact of reducing the number of features on the performance of the algorithms.

Algorithms

The study employed 13 different machine learning algorithms, including Gaussian Naive Bayes, Linear & Quadratic Discriminant Analysis, Ridge Classifier, k-Nearest Neighbors, Support Vector Machines, Decision Tree, Random Forest, Gradient Tree Boosting, Adaboost & XGBoost, Stochastic Gradient Descent & Multi-Layer Perceptron. The performance of each algorithm was evaluated using the F1-score as the primary metric, which is a measure of the balance between precision and recall. Additional metrics such as accuracy, precision, and recall were also used.

Requirements

The code in this repository was developed using Python 3.x and the following packages:

Pandas
Scikit-Learn
NumPy
SciPy
Matplotlib
Seaborn

Usage

Clone the repository to your local machine.
Install the required packages using $pip install -r requirements.txt
Open the Jupyter Notebook file in the repository and run the code cells.
The results of the study and the performance of the algorithms can be found in the notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Breast Cancer Testing		Breast Cancer Testing
Cross Validation (Wrong)		Cross Validation (Wrong)
Nested Cross Validation		Nested Cross Validation
Diploma Thesis Presentation.pdf		Diploma Thesis Presentation.pdf
Diploma_Thesis_Greek.pdf		Diploma_Thesis_Greek.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Cancer Classification with Machine Learning Methods

Introduction

Folders & Files

Data

Feature Selection & Dimensionality Reduction

Algorithms

Requirements

Usage

About

Releases

Packages

Languages

LazarosPan/Breast-Cancer-Classification-with-Machine-Learning-Methods

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Classification with Machine Learning Methods

Introduction

Folders & Files

Data

Feature Selection & Dimensionality Reduction

Algorithms

Requirements

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages