- Find the best machine learning model and params for a given dataset.
Find the Kaggle competition with your cohort name, i.e. diamonds-databcn0722, https://www.kaggle.com/competitions/diamonds-databcn0722/overview
-
- Processing/cleaning the dataset: this should be later modularized in functions.
-
-
Train a model (fit & predict) with the data in
train.csv
. This file DOES contain a y (price).- Do train, test, split on
train.csv
if necessary. - Choose the best model regarding the metrics. In this case, the lowest RMSE (error).
2.1. Export the model: we don't want to invest time/RAM resources on training the model again in the future.
- Do train, test, split on
-
-
- Apply the same cleaning to
test.csv
. This files does NOT contain a y (no price column).
- Apply the same cleaning to
-
- We'll apply the already trained model from step 2 to the
test.csv
file. With this we'll generate a new column with the predicted values.
- We'll apply the already trained model from step 2 to the
-
- Generate a
submission.csv
file with only two columns: the ID of the diamond & the predicted price (y).
- Generate a
In other words: use train.csv
to generate and save a model. Use test.csv
to predict new values. Then generate a DF with ID & predicted.
-
Jupyter notebooks where you show the process you followed to get to your submissions.
-
A slide (.ppt, ipynb, etc) with a summary of the metrics you obtained and the rationale behind it.
- Why do those params work better than others?
- Check often for the df.shape & len of the things your working with
- Do make sure you ONLY have two columns. When saving the file, you might save an "extra column" (index). So make sure you don't include it. There should only be two columns: id & price
- Take advantage of the daily submissions. Try at least one today!