Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0), lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) #6469

Open
Oct4Pie opened this issue May 31, 2024 · 2 comments
Labels

Comments

@Oct4Pie
Copy link

Oct4Pie commented May 31, 2024

Description

When using LightGBM with GPU training, an error is encountered during the training process. The error specifically occurs when LightGBM attempts to split the data into leaf nodes, resulting in a split where one of the resulting nodes has zero data points. This error does not occur when using CPU training.

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


def generate_synthetic_data(n_samples=10000, n_features=50):
    np.random.seed(42)
    X = np.random.rand(n_samples, n_features)
    y = np.sum(X, axis=1) + np.random.randn(n_samples) * 0.1
    return X, y


def check_data_variability(X_train, y_train):
    X_train_df = pd.DataFrame(X_train)
    y_train_series = pd.Series(y_train)

    print("X_train Feature Variability:")
    print(X_train_df.describe().transpose())
    print("\nNumber of unique values in each feature:")
    print(X_train_df.nunique())

    print("\ny_train Target Variability:")
    print(y_train_series.describe())
    print("Number of unique values in target:", y_train_series.nunique())


def initialize_gpu_model():
    params = {
        "boosting_type": "gbdt",
        "objective": "regression",
        "metric": "rmse",
        "learning_rate": 0.01,
        # "num_leaves": 15,
        # "max_depth": 5,
        # "min_child_samples": 1,
        # "min_child_weight": 1e-3,  # Align with min_child_samples
        # "min_split_gain": 0.1,
        "n_estimators": 10000,
        # "subsample": 0.1,
        # "subsample_freq": 1,
        # "colsample_bytree": 0.1,
        # "reg_alpha": 0.1,
        # "reg_lambda": 0.1,
        "verbose": 100,
        "device": "gpu",
    }
    model = lgb.LGBMRegressor(**params)
    print(model.get_params())
    return model


def main():
    X, y = generate_synthetic_data()
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    check_data_variability(X_train, y_train)

    model = initialize_gpu_model()

    model.fit(
        X_train,
        y_train,
        eval_set=[(X_test, y_test)],
        eval_metric="rmse",
    )
    y_pred = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"RMSE: {rmse}")


if __name__ == "__main__":
    main()

Environment info

lightgbm versions 4.2.0 and 4.3.0

Command(s) you used to install LightGBM

$ sh build-python.sh install --gpu

from the release tag branches

$ cmake -DUSE_GPU=ON

for lib_lightgbm.dylib

macOS 14.4.1 (23E224)
Apple Silicon M1
Tested with python versions 3.10, 3.11, and 3.12, with and without conda
cmake version 3.29.3

Additional Comments

The error only occurs with GPU training (device: "gpu").
The same parameters work fine when device: "cpu" is used.
Adjusting parameters like num_leaves, min_child_samples, max_depth, etc., to more conservative values did not resolve the issue.
Also, it is worth mentioning that generate_synthetic_data(n_samples, n_features) with n_samples less than ~2000 do not cause the issue. It only occurs when the data input becomes large, so subsampling can solve the issue but will significantly affect the performance and boosting as I tested.

@jameslamb jameslamb changed the title lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0), lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) [GPU] lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0), lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) Jun 10, 2024
@jameslamb jameslamb added the bug label Jun 10, 2024
@MascotGGG
Copy link

I got same issue in lgb 4.5.0, so how to solve this problem, do u have any ideas and suggestions @jameslamb

@Oct4Pie
Copy link
Author

Oct4Pie commented Dec 11, 2024

@MascotGGG I looked at the source code. It seems to occur when leaf nodes are being processed into the main tree.
I was able to resolve it with a specific parameter (don't remember as I have moved on from that project) but then it caused a segmentation fault due to causing some null pointers.
Stick with cpu training as it is more fast than gpu training on apple silicon either way in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants