enh: Post-Improving Boosting Models #1315

aPovidlo · 2024-07-31T15:31:53Z

Summary

TLDR:

Updating initial assumptions with boosting models;
Adding new evolutionary mutations connected with boosting models;
Allow the use of boosting model data with nan's.
Allow to use boosting with GPU.

Motivation

The motivation for refactoring boosting models was (#1155 #1209 #1264):

Update implementation and split it into a separate class strategy.
Allow boosting models to use categorical features without encoding.
To create a basis for implementing fitting with bagging (Bagging method implementation to FEDOT #1005) like the one used in other popular AutoML frameworks.

Results of testing on OpenML are available here. During the development, subsequent ideas for improvement arose.

Guide-level explanation

1. Updating initial assumptions with boosting models

Add more pipelines that contain boosting models to the initial assumptions. Updates presets by using boosting models. I want to draw your attention to the fact that boosting models have several strategies that can be used for various pipelines and presets. You can find more information about the strategy in the boosting framework documentation.

2. Evolutionary Mutations for Boosting Models

According to parameter updates, adding new mutations for pipelines with boosting models became possible.

Boosting strategy mutation:

Switch to another strategy method

Using category mutation:

Switch to enable_categorical (default: True)

Change early_stopping_rounds:

Increase or decrease early_stopping_rounds to change the fitting time in the population

Improving metric of XGBoost model mutation:

Decrease max_depth
Increase min_child_weight, gamma, lambda

Improving robustness for the noise of XGBoost model mutation:

Decrease subsample, colsample_bytree, colsample_bylevel, colsample_bynode for some step.
Switch to use the dart strategy method.

Improving metric of LightGBM model mutation:

Decrease learning_rate
Increase max_bin, num_iterations, num_leaves
Switch to use the dart strategy method.

Improving robustness for overfitting of LightGBM model mutation:

Decrease max_bin and num_leaves
Increase or decrease min_data_in_leaf and min_sum_hessian_in_leaf
Use bagging_fraction and bagging_freq
Use feature_fraction
Use regularization methods: lambda_l1, lambda_l2, min_gain_to_split and extra_trees
Decrease max_depth

Improving robustness for overfitting of Catboost model mutation:

Increase or decrease l2_leaf_reg, colsample_bylevel, subsample
Decrease max_depth
Increase or decrease iterations

3. Allow the use of boosting model data with nan's.

One of the advantages of boosting methods is comparing with nan's in data. Implementing this feature in the current version requires refactoring the preprocessing with filling.

4. Allow to use boosting with GPU.

It is possible to use a GPU to accelerate fitting boosting. Therefore, it would be great to add such an opportunity.

Unresolved Questions

Is it possible to continue #1005? The main idea was to develop a method used in other AutoML frameworks. The main problem that had to be faced and not solved is that the pipeline generation approach differs from other frameworks, parallelization of the learning process of basic models on extraped samples, and embedding such an approach into the composition process since it is pretty time-consuming and resource-intensive. However, it guarantees more stable and accurate models. It would be possible to use this approach after composting; for example, if a boosting model is found in the final pipeline, then try to train it using this method.

P.S.

I also note that adding a weighted model, models from a family based on k-nearest neighbors as a meta-model in an ensemble, will help diversify pipelines for classification and regression.

Also note about not perfect method to detect categorical features.

The text was updated successfully, but these errors were encountered:

aPovidlo added enhancement New feature or request composer Related to GP-composition algorithm architecture (re)design of existing or new framework subsystem labels Jul 31, 2024

nicl-nno mentioned this issue Oct 11, 2024

enh: Rework the list of tabular models and operations #1339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enh: Post-Improving Boosting Models #1315

enh: Post-Improving Boosting Models #1315

aPovidlo commented Jul 31, 2024 •

edited

Loading

enh: Post-Improving Boosting Models #1315

enh: Post-Improving Boosting Models #1315

Comments

aPovidlo commented Jul 31, 2024 • edited Loading

Summary

Motivation

Guide-level explanation

Unresolved Questions

P.S.

aPovidlo commented Jul 31, 2024 •

edited

Loading