Skip to content

feat: Parameterize TensorRT allocation strategy #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,21 @@ but the listed CMake argument can be used to override.
* triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
* triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
* triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

## Parameters

Triton exposes some flags to control the execution mode of the TensorRT models through
the Parameters section of the model's `config.pbtxt` file.

### execution_context_allocation_strategy

Different memory allocation behaviors for IExecutionContext. IExecutionContext requires a block of device memory for internal activation tensors during inference. The user can let the execution context manage the memory in various ways. Current options are "STATIC" (default) and "ON_PROFILE_CHANGE".

```
parameters: {
key: "execution_context_allocation_strategy"
value: {
string_value: "STATIC"
}
}
```
39 changes: 14 additions & 25 deletions src/instance_state.cc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -1693,19 +1693,6 @@ ModelInstanceState::InitIOIndexMap()
TRITONSERVER_Error*
ModelInstanceState::InitOptimizationProfiles()
{
// TRT sets the optimization profile index to be 0 implicitly with
// the first context creation. As currently triton supports one
// context per engine, in order to set the specified profile_index,
// another context is created and the previous context is destroyed.
std::shared_ptr<nvinfer1::IExecutionContext> default_trt_context(
engine_->createExecutionContext());
if (default_trt_context == nullptr) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to create TensorRT context: ") +
model_state_->GetTensorRTLogger().LastErrorMsg())
.c_str());
}
std::vector<std::pair<std::string, int>> profile_name_index;
// No optimization profile is set for this TensorRT plan
if (ProfileNames().empty()) {
Expand Down Expand Up @@ -1736,17 +1723,19 @@ ModelInstanceState::InitOptimizationProfiles()
.c_str());
continue;
}
if (profile_index == 0) {
res.first->second.context_ = std::move(default_trt_context);
} else {
res.first->second.context_.reset(engine_->createExecutionContext());
if (res.first->second.context_ == nullptr) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to create TensorRT context: ") +
model_state_->GetTensorRTLogger().LastErrorMsg())
.c_str());
}

// Create a new execution context for the profile
res.first->second.context_.reset(
engine_->createExecutionContext(model_state_->AllocationStrategy()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned the tests in triton-inference-server/server#8150 aren't comprehensive enough if they didn't catch the allocation strategy not being passed in. Is there a simple test that can be added to confirm the correct allocation strategy is being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried various plan models from /data/inferenceserver model repositories. The problem is that our model's allocation size is too small to show the difference (the line with [MemUsageChange]).

I0417 21:48:34.404149 20014 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: plan_float32_float32_float32-4-32_0_0 (GPU device 0)"
I0417 21:48:34.407117 20014 logging.cc:46] "Loaded engine size: 0 MiB"
I0417 21:48:34.410343 20014 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)"
I0417 21:48:34.410353 20014 logging.cc:46] "Switching optimization profile from: 0 to 6. Please ensure there are no enqueued operations pending in this context prior to switching profiles"

If we use custom model built for A100, we need to either add new script in gen_qa_model_repository or only allow test to run on A100 GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I loaded all plan models from /data/inferenceserver and none shows non-zero allocation size in line [MemUsageChange] (even for the large models). I discussed with Anmol Gupta and this might be specific to how their test works. I guess it's not easy to prove new allocation strategy is passed in engine_->createExecutionContext(model_state_->AllocationStrategy()).

Anmol Gupta has confirmed the change works. See #108 (comment). Can we merge the existing test triton-inference-server/server#8150 to main?

Copy link
Contributor

@rmccorm4 rmccorm4 Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can go with the manual confirmation from Anmol for now, as they were the requester and we don't want to delay getting this in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have ask Anmol to try to generate a sample TRT model using our gen_qa_model_repository scirpt and provided him with instructions. He will try later and let me know if it works or not.

Copy link
Contributor

@rmccorm4 rmccorm4 Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case Anmol doesn't have a simple one, something like trtexec --onnx model.onnx --saveEngine model.plan on the Densenet ONNX model we use for quickstart might work:

if (res.first->second.context_ == nullptr) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to create TensorRT context: ") +
model_state_->GetTensorRTLogger().LastErrorMsg())
.c_str());
}

if (profile_index != 0) {
if (!res.first->second.context_->setOptimizationProfileAsync(
profile_index, stream_)) {
return TRITONSERVER_ErrorNew(
Expand Down
42 changes: 40 additions & 2 deletions src/model_state.cc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -142,7 +142,8 @@ ModelState::Create(TRITONBACKEND_Model* triton_model, ModelState** state)
}

ModelState::ModelState(TRITONBACKEND_Model* triton_model)
: TensorRTModel(triton_model), engine_sharing_(true)
: TensorRTModel(triton_model), engine_sharing_(true),
alloc_strategy_(nvinfer1::ExecutionContextAllocationStrategy::kSTATIC)
{
// Obtain backend configuration
TRITONBACKEND_Backend* backend;
Expand Down Expand Up @@ -288,6 +289,43 @@ ModelState::ValidateModelConfig()
TRITONSERVER_Error*
ModelState::ParseParameters()
{
triton::common::TritonJson::Value params;
bool status = ModelConfig().Find("parameters", &params);
if (status) {
// If 'execution_context_allocation_strategy' is not present in
// 'parameters', will use the default strategy "STATIC".
std::string alloc_strategy;
TRITONSERVER_Error* err = GetParameterValue(
params, "execution_context_allocation_strategy", &alloc_strategy);
if (err != nullptr) {
if (TRITONSERVER_ErrorCode(err) != TRITONSERVER_ERROR_NOT_FOUND) {
return err;
} else {
TRITONSERVER_ErrorDelete(err);
}
} else {
// 'execution_context_allocation_strategy' is present in model config
// parameters.
if (alloc_strategy == "STATIC") {
alloc_strategy_ = nvinfer1::ExecutionContextAllocationStrategy::kSTATIC;
} else if (alloc_strategy == "ON_PROFILE_CHANGE") {
alloc_strategy_ =
nvinfer1::ExecutionContextAllocationStrategy::kON_PROFILE_CHANGE;
} else {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INVALID_ARG,
("Invalid value for 'execution_context_allocation_strategy': '" +
alloc_strategy + "' for model instance '" + Name() +
"'. Supported values are 'STATIC' and 'ON_PROFILE_CHANGE'.")
.c_str());
}
LOG_MESSAGE(
TRITONSERVER_LOG_INFO,
("'execution_context_allocation_strategy' set to '" + alloc_strategy +
"' for model instance '" + Name() + "'")
.c_str());
}
}
return nullptr; // success
}

Expand Down
9 changes: 8 additions & 1 deletion src/model_state.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -88,6 +88,11 @@ class ModelState : public TensorRTModel {

TensorRTLogger& GetTensorRTLogger() { return tensorrt_logger_; }

nvinfer1::ExecutionContextAllocationStrategy AllocationStrategy() const
{
return alloc_strategy_;
}

private:
ModelState(TRITONBACKEND_Model* triton_model);

Expand Down Expand Up @@ -140,6 +145,8 @@ class ModelState : public TensorRTModel {

// Whether the backend should support version-compatible TensorRT models.
static inline bool is_version_compatible_{false};

nvinfer1::ExecutionContextAllocationStrategy alloc_strategy_;
};


Expand Down