This repository presents an example of prediction of car prices using each car's age and technical performance. A baseline linear regression model is compared to 2 models generated by AutoML (H2O, TPOT).
About the dataset.
This dataset is a reduced version of this one: https://www.kaggle.com/datasets/klkwak/toyotacorollacsv
Each row of the table presents the data of a transaction of selling a car: the properties of a car and the amount of transaction.
The 9 features are:
- 5 numeric : Age, KM, HP, CC (== Cubic capacity), Weight;
- 3 categorical encoded by numbers: MetColor, Automatic, Doors ;
- 1 categorical encoded by string: FuelType (values: Petrol, Diesel, CNG)
About the problem. The model accuracy is evaluated using its RMSE on the test set.
Selected models and AutoML instruments.
- Baseline: linear regression on 3 features 'Age', 'HP', 'Weight' (lin_reg_crossval.ipynb).
- model selected using H2O (h2o_regression.ipynb)
- model selected using TPOT (TPOT_regression.ipynb, exported file: TPOT_regression.py). RandomForestRegressor
EDA and the baseline model were run locally in the environment specified in environment.yml
.
AutoML model selection was performed on the servers of GoogleColab.
Results.
- RMSE of the baseline model: 1592$
- RMSE of the model selected by H20 in 1 minute: 1122$
- RMSE of the model selected by TPOT in 30 sec (RandomForestRegressor): 1195$
Feedback and additional questions.
All questions about the source code should be adressed to its author Alexandre Aksenov:
- GitHub: Alexandre-aksenov
- Email: [email protected]