Is predict(..., pred_contrib=True) thread safe? #5482

zyxue · 2022-09-13T18:38:48Z

Description

I'm using the model behind a gRCP prediction service. The service predicts one example at a time.

I can reproduce the bug locally like

data = [df.loc[:1][model.feature_name()]] * 1_000

def _predict(df_one_row):
    return model.predict(df_one_row, pred_contrib=True)

with ThreadPoolExecutor(max_workers=32) as exc:
    exc.map(_predict, data)

The error is like

free(): invalid next size (normal)

or

double free or corruption (!prev)

or

malloc(): corrupted top size

depending on different runs.

If I set max_workers=1, the error seems to go away.
If I remove pred_contrib=True, the error seems to go away, so it looks predict alone is thread-safe, consistent with Is the function "LGBM_BoosterPredictForMat" thread safe? #666

Reproducible example

I can produce some diff error, but also related to threading using the following code:

from concurrent.futures import ThreadPoolExecutor

import sklearn.datasets
import lightgbm

df = (
    sklearn.datasets.load_iris(as_frame=True)["frame"]
    .sample(99, random_state=123)
    .rename(
        columns={
            "sepal length (cm)": "sepal_length",
            "sepal width (cm)": "sepal_width",
            "petal length (cm)": "petal_length",
            "petal width (cm)": "petal_width",
        }
    )
    .assign(sepal_length_cat=lambda df: (df.sepal_length > 1).astype(str).astype('category'))
    .reset_index(drop=True)
)

X, y = df.drop(columns="target"), df["target"]

regressor = lightgbm.LGBMRegressor(n_estimators=100, max_depth=7, objective="mse").fit(
    X, y
)

regressor.fit(X, y)

model = regressor.booster_

print(f'{df.dtypes=:}')

data = [df.loc[:1][model.feature_name()]] * 10_000


def _predict(df_one_row):
    return model.predict(df_one_row, pred_contrib=True)


with ThreadPoolExecutor(max_workers=32) as exc:
    exc.map(_predict, data)

The error is like

corrupted size vs. prev_size
corrupted size vs. prev_size

or

python3: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

Environment info

LightGBM version or commit hash:

lightgbm==3.3.2

Command(s) you used to install LightGBM

I'm using lightgbm inside a monorepo with bazel, but I think under the hood it's equivalent to python -m pip install lightgbm

Additional Comments

The text was updated successfully, but these errors were encountered:

shuttie · 2024-06-28T13:20:02Z

Seems to be an issue related to the pred_contrib=True handling in the C library itself: in the lightgbm4j library we have an issue with concurrent prediction crashing the whole JVM process due to the call not being thread-safe. See issue metarank/lightgbm4j#88 for details and a reproducer.

AndreyOrb · 2025-01-06T22:42:26Z

I found a bug in Predictor related to incorrect usage of omp_get_thread_num.

LightGBM/src/application/predictor.hpp

Line 94 in e0c34e7

int tid = omp_get_thread_num();

When working in multi-threaded mode, different threads can return tid=0.
https://stackoverflow.com/questions/68484180/is-it-safe-to-use-omp-get-thread-num-to-index-a-global-vector
https://stackoverflow.com/questions/4087852/omp-set-num-threads-always-returns-0-and-im-unable-to-get-thread-num-with-omp-ge
https://stackoverflow.com/questions/43952078/openmp-always-working-on-the-same-thread

So, different threads (predict_fun_ lambdas) can write to the same memory block.
There's another issue mentioning the same problem: #3751

I think the same problem is in all 4 prediction types.

jameslamb · 2025-01-07T03:53:21Z

Thanks @AndreyOrb . Are you interested in proposing a fix? We'd welcome it!

It is probably worth looking at these other places, to see if they're safe:

git grep omp_get_thread_num

AndreyOrb · 2025-01-08T15:21:35Z

I have reviewed the issue carefully.
I actually identified 2 problems.

Improper Predictor lifecycle management:

For each Predict request, there is a Predictor that is being created in

LightGBM/src/c_api.cpp

Line 463 in e0c34e7

Predictor CreatePredictor(int start_iteration, int num_iteration, int predict_type, int ncol, const Config& config) const {

The Predictor class has predict_buf_, used for storing features:

LightGBM/src/application/predictor.hpp

Line 297 in e0c34e7

std::vector<std::vector<double, Common::AlignmentAllocator<double, kAlignedSize>>> predict_buf_;

Its constructor resizes this internal buffer according to number of features:

LightGBM/src/application/predictor.hpp

Line 66 in e0c34e7

predict_buf_.resize(

After the resize, a predictor function is constructed. It captures the address of the predict_buf_ vector:

LightGBM/src/application/predictor.hpp

Line 92 in e0c34e7

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,

All good, but when a Predictor object is passed out, a temporary object is created. Then, move semantics (if available) or copy semantics are used to initialize the predictor variable:

LightGBM/src/c_api.cpp

Line 490 in e0c34e7

auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);

In both cases, the original address is invalidated. When a function is started, it's looking at the predict_buf_ address that is not valid anymore:

LightGBM/src/application/predictor.hpp

Line 95 in e0c34e7

CopyToPredictBuffer(predict_buf_[tid].data(), features);

I have tried refactoring the code to work with "move" operator, but again, it des not help because the address gets destroyed.
Because of the location of the function creation (inside the constructor), the only way is to preserve the temporary Predictor object.
So, I replaced the CreatePredictor's return type to std::shared_ptr. That solved the problem.
There is a unique Boosting object that is essentially a model.
Then, again, in each Predict request, a new Predictor object is created.
In its constructor, the GBDT's InitPredict is being called:

LightGBM/src/application/predictor.hpp

Line 61 in e0c34e7

boosting->InitPredict(start_iteration, num_iteration, predict_contrib);

The GBDT object is the Boosting itself, as of my understanding. It is unique for each model.
InitPredict changes GBDT's internal attributes. So, running 2 different concurrent Predict requests are not thread-safe.
To overcome the issue, there's a lock:

LightGBM/src/c_api.cpp

Line 489 in e0c34e7

SHARED_LOCK(mutex_);

LightGBM/src/c_api.cpp

Line 57 in e0c34e7

#define SHARED_LOCK(mtx) \

But, this lock is probably does not lock as expected, because I end up passing is in several threads:

Anyway, it's fine in my case, because all of my Predict requests have the same parameters (int start_iteration, int num_iteration, bool is_pred_contrib).
The problem is here:

LightGBM/src/boosting/gbdt.h

Line 436 in e0c34e7

if (is_pred_contrib) {

This code runs RecomputeMaxDepth on each model's tree. When I hit this line in several threads, I get corrupted memory and all the application crashes. Moreover, I don't think this code should recompute the max tree depth in each contrib request. Please correct me, if I'm wrong. Maybe in training there's such a need.

I've added a lock and proper condition (double-checked locking) to recompute the max tree depth only once.
That has helped. I tested the fixed code by running 200k requests that run Predict (Regular & Contrib) in 20 parallel threads. No crashes.
Will send the PR shortly after.

2) Fixed Boosting trees initialization microsoft#5482

jameslamb added the question label Feb 1, 2023

shuttie mentioned this issue Jun 28, 2024

face the memory problem when use PredictFormat metarank/lightgbm4j#88

Open

AndreyOrb added a commit to AndreyOrb/LightGBM that referenced this issue Jan 8, 2025

1) Fixed Predictor lifecycle

2dda961

2) Fixed Boosting trees initialization microsoft#5482

AndreyOrb mentioned this issue Jan 8, 2025

Fixed Predictor lifecycle and trees initialization in Contrib mode #6778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is predict(..., pred_contrib=True) thread safe? #5482

Is predict(..., pred_contrib=True) thread safe? #5482

zyxue commented Sep 13, 2022

shuttie commented Jun 28, 2024

AndreyOrb commented Jan 6, 2025

jameslamb commented Jan 7, 2025

AndreyOrb commented Jan 8, 2025