Bayesian hyperparameter tuning for LGBMClassifier, LGBMRegressor, CatBoostClassifier and CatBoostRegressor with a scikit-learn API
gecs
is a tool to help automate the process of hyperparameter tuning for boosting classifiers and regressors, which can potentially save significant time and computational resources in model building and optimization processes. The GEC
stands for Good Enough Classifier, which allows you to focus on other tasks such as feature engineering. If you deploy 100 of them, you get 100GECs.
The primary class in this package is LightGEC
, which is derived from LGBMClassifier
. Like its parent, LightGEC can be used to build and train gradient boosting models, but with the added feature of automated bayesian hyperparameter optimization. It can be imported from gecs.lightgec
and then used in place of LGBMClassifier
, with the same API.
By default, LightGEC
optimizes num_leaves
, boosting_type
, learning_rate
, reg_alpha
, reg_lambda
, min_child_samples
, min_child_weight
, colsample_bytree
, subsample_freq
, subsample
and optionallyn_estimators
. Which hyperparameters to tune is fully customizable.
The installation requires cmake
, which can be installed using apt
on linux or brew
on mac. Then you can install (100)gecs using pip.
pip install gecs
The LightGEC
class provides the same API to the user as the LGBMClassifier
class of lightgbm
, and additionally:
-
the two additional parameters to the fit method:
-
n_iter
: Defines the number of hyperparameter combinations that the model should try. More iterations could lead to better model performance, but at the expense of computational resources -
fixed_hyperparameters
: Allows the user to specify hyperparameters that the GEC should not optimize. By default, onlyn_estimators
is fixed. Any of the LGBMClassifier init arguments can be fixed, and so cansubsample_freq
andsubsample
, but only jointly. This is done by passing the valuebagging
.
-
-
the methods
serialize
anddeserialize
, which stores theLightGEC
state for the hyperparameter optimization process, but not the fittedLGBMClassifier
parameters, to a json file. To store the boosted tree model itself, you have to provide your own serialization or usepickle
-
the methods
freeze
andunfreeze
that turn theLightGEC
functionally into aLGBMClassifier
and back
The default use of LightGEC
would look like this:
from sklearn.datasets import load_iris
from gecs.lightgec import LightGEC # LGBMClassifier with hyperparameter optimization
from gecs.lightger import LightGER # LGBMRegressor with hyperparameter optimization
from gecs.catgec import CatGEC # CatBoostClassifier with hyperparameter optimization
from gecs.catger import CatGER # CatBoostRegressor with hyperparameter optimization
X, y = load_iris(return_X_y=True)
# fit and infer GEC
gec = LightGEC()
gec.fit(X, y)
yhat = gec.predict(X)
# manage GEC state
path = "./gec.json"
gec.serialize(path) # stores gec data and settings, but not underlying LGBMClassifier attributes
gec2 = LightGEC.deserialize(path, X, y) # X and y are necessary to fit the underlying LGBMClassifier
gec.freeze() # freeze GEC so that it behaves like a LGBMClassifier
gec.unfreeze() # unfreeze to enable GEC hyperparameter optimisation
# benchmark against LGBMClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = LGBMClassifier()
lgbm_score = np.mean(cross_val_score(clf, X, y))
gec.freeze()
gec_score = np.mean(cross_val_score(gec, X, y))
print(f"{gec_score = }, {lgbm_score = }")
assert gec_score > lgbm_score, "GEC doesn't outperform LGBMClassifier"
#check what hyperparameter combinations were tried
gec.tried_hyperparameters()
If you want to contribute, please reach out and I'll design a process around it.
MIT
You can find my contact information on my website: https://leonluithlen.eu