As part of Machine Learning Study Group from recworks meet-a-mentor community work on kaggle ML project.
We use Jupyter Notebook, Python and some libraries (Pandas, NumPy, Matplotlib, and Scikit-learn) to solve ML problems on public datasets. We have started with the following dataset from Kaggle:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
This project is the result of the joint effort of a study group of enthusiastic people from different professional backgrounds. Anyone is welcome to join and to contribute with the discussions.
The best way to join the project is by assisting during our regular sessions. Also, feel free to fork and check what we have done in this project. We are very happy to receive any comment and suggestions to improve.
There are two notebooks contained in the repository:
This notebook contains:
-
1 - Define the problem
-
2 - Load data and displaying info
-
3 - Prepare Data
- [Identify features]
- Separate numerical from categorical features
- Separate nominal and ordinal (from categorical features)
- [Clean data]
- Remove numerical features with missing values
- Remove categorical features with missing values
- drop outliers in numerical values # WIP
- [transform]
- transform categorical values #TODO
- [Identify features]
-
3 - Feature selection
- [Select features using random forest classifier]
-
5 - Spot Check Algorithms
- [split dataset]
- [train on multiple algorithms]
This notebook contains a reference for some techniques to tackle the problem to detect and remove outliers sample from the dataset.