A binary classification problem that determines if water is drinkable based upon numerical features such as pH levels.
Performed Exploratory Data Analysis to better understand the data I was dealing with and inform decisions around pre-processing.
Following this analysis, once the data was prepared I tried various supervised learning modules, tuning hyper-parameters with cross-validation and trying different evaluation metrics.
Once these supervised learning methods were trained and tuned, I evaluated their performance on the test set.
This dataset proved to be a very difficult classification task, with it being very hard to perform above low to mid-60s in terms of percent.
The most effective classifiers ended up being SVMs and KNNs models. However, the best performance was achieved by creating custom ensemble methods in voting and stacking which utilized these single models and other built-in ensemble methods like Random Forests and Adaboost.
Ensembling together multiple models was the most effective approach. Different methods of imputation helped with certain models, but overall I believe the hardest part of this classification task is having domain knowledge to construct better features and not having a large number of examples. This was a great learning experience and allowed me to get familiar with many applications of ML, namely those within Sci-kit Learn. I also got experience with analyzing the data to better inform decisions made about each model.