My postgraduate studies project: A classifier to determine whether a mushroom is poisonous or edible
In this project, a classifier was built to determine if a mushroom is poisonous based on its external appearance characteristics. An attempt was also made to interpret the classifier using the DALEX
package and to explain the prediction for a selected observation.
The data was donated in April 2021 on the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Secondary+Mushroom+Dataset) and contains 61069 hypothetical fungal observations that were randomly simulated from the popular Mushroom Data Set (1987), which contains information on 173 fungal species. The dataset was created as part of the thesis work of D. Heider and G. Hattab at the University of Marburg.
The set consists of a class
target variable determining whether the mushroom is poisonous or edible and 20 variables.
-
Label encoding: Dataset includes mostly categorical variables. Due to the fact that machine learning algorithms perform better with numerical inputs, categorical encoding was needed. Because of the rather large number of classes in the qualitative variables, Label Encoding was used because One-Hot Encoding would generate a very large number of predictors. Encoding was performed with
Sci-Kit Learn
. -
DALEX
: The project was an opportunity for me to try out theDALEX
library which includes methods for the exploration, explanation and visualization of machine learning models. Recently, there is a growing interest in the topic of XAI (Explainable artificial intelligence) and the development of new XAI methods. This is due to the increasing demand for interpretation of "black box" models. In my work, I've visualized the differences between 3 models and i showed how each model deals with classification of selectd observation (example histogram for Logistic Regression model below).
Project is divided into parts:
- Introduction
- Data description
- Preparing data for modeling
3.1. Importing Libraries and Dataset
3.2. EDA
3.2.1. Continuous Variables
3.3.2. Qualitative variables
3.3. Data Cleaning (NAs)
3.3.1. Label Encoding
3.3.2. Checking correlation between variables
3.4. Split Dataset to Train and Test sets - Modeling
4.1.
4.1.1. Logistic regression
4.1.2. Support vector machine (SVM)
4.1.3. Random Forest
4.2. Cross-validation
4.2.1. Logistic regression
4.2.2. Support Vector Machine (SVM)
4.2.3. Random Forest
4.3. Summary
4.4. Interpretation of models withDALEX
Conclusions
Python libraries:
NumPy
Pandas
Matplotlib.pyplot
Seaborn
SciKit-Learn
DALEX