Skip to content

Latest commit

 

History

History
 
 

04. Tree models

Tabular data

Feature preprocessing

Tree based models

  • Decission tree
  • Random Forest
  • Extra trees
  • Adaboost
  • Gradient Boosting
  • XGBoost
  • LightGBM
  • CatBoost

No-tree based models

  • Linear Models
  • Neural Networks
  • K-Nearest Neighbors
  • Suport Vector Machines
Categorical
Ordinal
  • Label encoding
  • Frequency encoding
  • One hot encoding
  • Embedding
Numerical Nothing
  • MinMaxScaler
  • StandarScaler
  • Skewed?
    • np.log(1+x)
    • np.sqrt(x+2/3)

Models

Model Comment Library More info
Decission Tree Simple and explicable. Sklearn
Linear models Simple and explicable. Sklearn
Random Forest Good starting point (tree enesemble) Sklearn
Gradient Boosting Usually the best (tree enesemble) XGBoost, LighGBM, Catboost
Neural Network Good if lot of data. Fast.ai v2 blog

We have two approaches to tabular modelling: decision tree ensembles, and neural networks. And we have mentioned two different decision tree ensembles: random forests, and gradient boosting. Each is very effective, but each also has compromises:

Random Forest

Are the easiest to train, because they are extremely resilient to hyperparameter choices, and require very little preprocessing. They are very fast to train, and should not overfit, if you have enough trees. But, they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods

Gradient Boosting

In theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit. At inference time they will be less fast, because they cannot operate in parallel. But they are often a little bit more accurate than random forests.

Neural Network

Take the longest time to train, and require extra preprocessing such as normalisation; this normalisation needs to be used at inference time as well. They can provide great results, and extrapolate well, but only if you are careful with your hyperparameters, and are careful to avoid overfitting.

Conclusion

We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it's a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try Gradient Boosting and Neural Nets, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them.