Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is predict(..., pred_contrib=True) thread safe? #5482

Open
zyxue opened this issue Sep 13, 2022 · 4 comments
Open

Is predict(..., pred_contrib=True) thread safe? #5482

zyxue opened this issue Sep 13, 2022 · 4 comments
Labels

Comments

@zyxue
Copy link
Contributor

zyxue commented Sep 13, 2022

Description

I'm using the model behind a gRCP prediction service. The service predicts one example at a time.

I can reproduce the bug locally like

data = [df.loc[:1][model.feature_name()]] * 1_000

def _predict(df_one_row):
    return model.predict(df_one_row, pred_contrib=True)

with ThreadPoolExecutor(max_workers=32) as exc:
    exc.map(_predict, data)

The error is like

free(): invalid next size (normal)

or

double free or corruption (!prev)

or

malloc(): corrupted top size

depending on different runs.

Reproducible example

I can produce some diff error, but also related to threading using the following code:

from concurrent.futures import ThreadPoolExecutor

import sklearn.datasets
import lightgbm

df = (
    sklearn.datasets.load_iris(as_frame=True)["frame"]
    .sample(99, random_state=123)
    .rename(
        columns={
            "sepal length (cm)": "sepal_length",
            "sepal width (cm)": "sepal_width",
            "petal length (cm)": "petal_length",
            "petal width (cm)": "petal_width",
        }
    )
    .assign(sepal_length_cat=lambda df: (df.sepal_length > 1).astype(str).astype('category'))
    .reset_index(drop=True)
)

X, y = df.drop(columns="target"), df["target"]

regressor = lightgbm.LGBMRegressor(n_estimators=100, max_depth=7, objective="mse").fit(
    X, y
)

regressor.fit(X, y)

model = regressor.booster_

print(f'{df.dtypes=:}')

data = [df.loc[:1][model.feature_name()]] * 10_000


def _predict(df_one_row):
    return model.predict(df_one_row, pred_contrib=True)


with ThreadPoolExecutor(max_workers=32) as exc:
    exc.map(_predict, data)

The error is like

corrupted size vs. prev_size
corrupted size vs. prev_size

or

python3: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

Environment info

LightGBM version or commit hash:

lightgbm==3.3.2

Command(s) you used to install LightGBM

I'm using lightgbm inside a monorepo with bazel, but I think under the hood it's equivalent to python -m pip install lightgbm

Additional Comments

@shuttie
Copy link
Contributor

shuttie commented Jun 28, 2024

Seems to be an issue related to the pred_contrib=True handling in the C library itself: in the lightgbm4j library we have an issue with concurrent prediction crashing the whole JVM process due to the call not being thread-safe. See issue metarank/lightgbm4j#88 for details and a reproducer.

@AndreyOrb
Copy link

I found a bug in Predictor related to incorrect usage of omp_get_thread_num.

int tid = omp_get_thread_num();

When working in multi-threaded mode, different threads can return tid=0.
https://stackoverflow.com/questions/68484180/is-it-safe-to-use-omp-get-thread-num-to-index-a-global-vector
https://stackoverflow.com/questions/4087852/omp-set-num-threads-always-returns-0-and-im-unable-to-get-thread-num-with-omp-ge
https://stackoverflow.com/questions/43952078/openmp-always-working-on-the-same-thread

So, different threads (predict_fun_ lambdas) can write to the same memory block.
There's another issue mentioning the same problem: #3751

I think the same problem is in all 4 prediction types.

@jameslamb
Copy link
Collaborator

Thanks @AndreyOrb . Are you interested in proposing a fix? We'd welcome it!

It is probably worth looking at these other places, to see if they're safe:

git grep omp_get_thread_num

@AndreyOrb
Copy link

I have reviewed the issue carefully.
I actually identified 2 problems.

  1. Improper Predictor lifecycle management:

    For each Predict request, there is a Predictor that is being created in

    Predictor CreatePredictor(int start_iteration, int num_iteration, int predict_type, int ncol, const Config& config) const {

    The Predictor class has predict_buf_, used for storing features:

    std::vector<std::vector<double, Common::AlignmentAllocator<double, kAlignedSize>>> predict_buf_;

    Its constructor resizes this internal buffer according to number of features:

    predict_buf_.resize(

    After the resize, a predictor function is constructed. It captures the address of the predict_buf_ vector:

    predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,

    All good, but when a Predictor object is passed out, a temporary object is created. Then, move semantics (if available) or copy semantics are used to initialize the predictor variable:

    auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);

    In both cases, the original address is invalidated. When a function is started, it's looking at the predict_buf_ address that is not valid anymore:

    CopyToPredictBuffer(predict_buf_[tid].data(), features);

    I have tried refactoring the code to work with "move" operator, but again, it des not help because the address gets destroyed.
    Because of the location of the function creation (inside the constructor), the only way is to preserve the temporary Predictor object.
    So, I replaced the CreatePredictor's return type to std::shared_ptr. That solved the problem.

  2. There is a unique Boosting object that is essentially a model.
    Then, again, in each Predict request, a new Predictor object is created.
    In its constructor, the GBDT's InitPredict is being called:

    boosting->InitPredict(start_iteration, num_iteration, predict_contrib);

    The GBDT object is the Boosting itself, as of my understanding. It is unique for each model.
    InitPredict changes GBDT's internal attributes. So, running 2 different concurrent Predict requests are not thread-safe.
    To overcome the issue, there's a lock:

    SHARED_LOCK(mutex_);

    #define SHARED_LOCK(mtx) \

    But, this lock is probably does not lock as expected, because I end up passing is in several threads:
    Image

    Anyway, it's fine in my case, because all of my Predict requests have the same parameters (int start_iteration, int num_iteration, bool is_pred_contrib).
    The problem is here:

    if (is_pred_contrib) {

    This code runs RecomputeMaxDepth on each model's tree. When I hit this line in several threads, I get corrupted memory and all the application crashes. Moreover, I don't think this code should recompute the max tree depth in each contrib request. Please correct me, if I'm wrong. Maybe in training there's such a need.

    I've added a lock and proper condition (double-checked locking) to recompute the max tree depth only once.
    That has helped. I tested the fixed code by running 200k requests that run Predict (Regular & Contrib) in 20 parallel threads. No crashes.
    Will send the PR shortly after.

AndreyOrb added a commit to AndreyOrb/LightGBM that referenced this issue Jan 8, 2025
2) Fixed Boosting trees initialization

microsoft#5482
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants