Skip to content

Analyzing a dataset on features for water, and trying to classify it as potable (drinkable) or not-potable. Utilized binary classification techniques to make predictions.

Notifications You must be signed in to change notification settings

bwoody13/water-potability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Water Potability Classification

A binary classification problem that determines if water is drinkable based upon numerical features such as pH levels.

Approach

Performed Exploratory Data Analysis to better understand the data I was dealing with and inform decisions around pre-processing.

Following this analysis, once the data was prepared I tried various supervised learning modules, tuning hyper-parameters with cross-validation and trying different evaluation metrics.

Once these supervised learning methods were trained and tuned, I evaluated their performance on the test set.

Insights

This dataset proved to be a very difficult classification task, with it being very hard to perform above low to mid-60s in terms of percent.

The most effective classifiers ended up being SVMs and KNNs models. However, the best performance was achieved by creating custom ensemble methods in voting and stacking which utilized these single models and other built-in ensemble methods like Random Forests and Adaboost.

Final results and remarks

Ensembling together multiple models was the most effective approach. Different methods of imputation helped with certain models, but overall I believe the hardest part of this classification task is having domain knowledge to construct better features and not having a large number of examples. This was a great learning experience and allowed me to get familiar with many applications of ML, namely those within Sci-kit Learn. I also got experience with analyzing the data to better inform decisions made about each model.

About

Analyzing a dataset on features for water, and trying to classify it as potable (drinkable) or not-potable. Utilized binary classification techniques to make predictions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published