Skip to content

Commit b7868dd

Browse files
authored
Update TensorRT-LLM (NVIDIA#2413)
1 parent f6821ee commit b7868dd

File tree

366 files changed

+109008
-7479
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

366 files changed

+109008
-7479
lines changed

Diff for: .gitignore

+3-3
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,15 @@ dump*/
2929
config.json
3030
/*.svg
3131
cpp/cmake-build-*
32-
cpp/.ccache/
32+
cpp/.ccache
3333
tensorrt_llm/bin
3434
tensorrt_llm/libs
3535
tensorrt_llm/bindings.*.so
3636
tensorrt_llm/bindings.pyi
37-
tensorrt_llm/bindings/*.pyi
37+
tensorrt_llm/bindings/**/*.pyi
3838
*docs/cpp_docs*
3939
*docs/source/_cpp_gen*
40-
docs/source/llm-api
40+
docs/source/llm-api/*.rst
4141
docs/source/llm-api-examples/llm_*.rst
4242
*.swp
4343

Diff for: README.md

+15-7
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,29 @@ TensorRT-LLM
1111
[![version](https://img.shields.io/badge/release-0.15.0.dev-green)](./tensorrt_llm/version.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

14-
[Architecture](./docs/source/architecture/overview.md)   |   [Results](./docs/source/performance/perf-overview.md)   |   [Examples](./examples/)   |   [Documentation](./docs/source/)
14+
[Architecture](./docs/source/architecture/overview.md)   |   [Results](./docs/source/performance/perf-overview.md)   |   [Examples](./examples/)   |   [Documentation](./docs/source/)   |   [Roadmap](https://docs.google.com/presentation/d/1gycPmtdh7uUcH6laOvW65Dbp9F1McUkGDIcAyjicBZs/edit?usp=sharing)
1515

1616
---
1717
<div align="left">
1818

1919
## Latest News
20+
21+
* [2024/11/02] 🌟🌟🌟 NVIDIA and LlamaIndex Developer Contest
22+
🙌 Enter for a chance to win prizes including an NVIDIA® GeForce RTX™ 4080 SUPER GPU, DLI credits, and more🙌
23+
[➡️ link](https://developer.nvidia.com/llamaindex-developer-contest)
24+
<div align="center">
25+
<img src="docs/source/media/image-11-02-2024.png" width="50%">
26+
<div align="left">
27+
28+
* [2024/10/28] 🏎️🏎️🏎️ NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models
29+
[➡️ link](https://developer.nvidia.com/blog/nvidia-gh200-superchip-accelerates-inference-by-2x-in-multiturn-interactions-with-llama-models/)
30+
2031
* [2024/10/22] New 📝 Step-by-step instructions on how to
2132
✅ Optimize LLMs with NVIDIA TensorRT-LLM,
2233
✅ Deploy the optimized models with Triton Inference Server,
2334
✅ Autoscale LLMs deployment in a Kubernetes environment.
2435
🙌 Technical Deep Dive:
2536
[➡️ link](https://nvda.ws/3YgI8UT)
26-
<div align="center">
27-
<img src="docs/source/media/image-10-22-2024.png" width="50%">
28-
<div align="left">
2937

3038
* [2024/10/07] 🚀🚀🚀Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries
3139
[➡️ link](https://developer.nvidia.com/blog/optimizing-microsoft-bing-visual-search-with-nvidia-accelerated-libraries/)
@@ -45,6 +53,9 @@ TensorRT-LLM
4553
* [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML
4654
[➡️ link](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml)
4755

56+
<details close>
57+
<summary>Previous News</summary>
58+
4859
* [2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12
4960
[➡️ link](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/)
5061

@@ -71,9 +82,6 @@ TensorRT-LLM
7182
* [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100.
7283
[➡️ Tech blog](https://developer.nvidia.com/blog/achieving-high-mixtral-8x7b-performance-with-nvidia-h100-tensor-core-gpus-and-tensorrt-llm?ncid=so-twit-928467)
7384

74-
<details close>
75-
<summary>Previous News</summary>
76-
7785
* [2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ai’s solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ✨[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
7886

7987
* [2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )

Diff for: benchmarks/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,6 @@ There are currently three workflows to benchmark TensorRT-LLM:
77
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
88
* [Python benchmarks](./python)
99
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
10-
* [The Python benchmarking suite](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html)
10+
* [The Python benchmarking suite](../docs/source/performance/perf-benchmarking.md)
1111
- This benchmarker is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
1212
- _NOTE_: This benchmarking suite is a current work in progress and is prone to large changes.

Diff for: benchmarks/cpp/gptManagerBenchmark.cpp

+13-1
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,7 @@ struct BenchmarkParams
147147
std::optional<float> freeGpuMemoryFraction{std::nullopt};
148148
std::optional<float> crossKvCacheFraction{std::nullopt};
149149
bool enableTrtOverlap{false};
150+
bool enableBatchSizeTuning{false};
150151
bool enableBlockReuse{false};
151152
bool enableChunkedContext{false};
152153
bool streaming{false};
@@ -879,8 +880,9 @@ class ExecutorServer
879880
, mShutdown(false)
880881
, mLogIterationData(logIterationData)
881882
{
883+
texec::DynamicBatchConfig dynamicBatchConfig(benchmarkParams.enableBatchSizeTuning);
884+
texec::SchedulerConfig schedulerConfig(capacitySchedulerPolicy, std::nullopt, dynamicBatchConfig);
882885

883-
texec::SchedulerConfig schedulerConfig(capacitySchedulerPolicy);
884886
texec::KvCacheConfig kvCacheConfig(benchmarkParams.enableBlockReuse, benchmarkParams.maxTokensInPagedKvCache,
885887
benchmarkParams.maxAttentionWindowVec, benchmarkParams.sinkTokenLength,
886888
benchmarkParams.freeGpuMemoryFraction, benchmarkParams.kvHostCacheSize, benchmarkParams.kvOnboardBlocks,
@@ -1971,6 +1973,8 @@ int main(int argc, char* argv[])
19711973
"max_num_tokens", "The max runtime number of tokens per batch when benchmarking", cxxopts::value<int>());
19721974
options.add_options()("enable_trt_overlap", "Overlap TRT context preparation and execution",
19731975
cxxopts::value<bool>()->default_value("false"));
1976+
options.add_options()(
1977+
"enable_batch_size_tuning", "Dynamic tuning of batch size", cxxopts::value<bool>()->default_value("false"));
19741978
options.add_options()("enable_exp_delays", "Enables exponential delay distr to mimic real world request arrival",
19751979
cxxopts::value<bool>()->default_value("false"));
19761980
options.add_options()("streaming", "Operate in streaming mode", cxxopts::value<bool>()->default_value("false"));
@@ -2152,6 +2156,9 @@ int main(int argc, char* argv[])
21522156
// Argument: Enable TRT overlap
21532157
benchmarkParams.enableTrtOverlap = result["enable_trt_overlap"].as<bool>();
21542158

2159+
// Argument: Enable dynamic tuning of batch size
2160+
benchmarkParams.enableBatchSizeTuning = result["enable_batch_size_tuning"].as<bool>();
2161+
21552162
// Argument: Enable KV cache reuse
21562163
benchmarkParams.enableBlockReuse = result["enable_kv_cache_reuse"].as<bool>();
21572164

@@ -2190,6 +2197,11 @@ int main(int argc, char* argv[])
21902197
// Argument: Enable batch stats output
21912198
bool logIterationData = result["log_iteration_data"].as<bool>();
21922199

2200+
if (logIterationData)
2201+
{
2202+
TLLM_LOG_WARNING("Setting log_iteration_data to true adds overheads and may result in lower perf");
2203+
}
2204+
21932205
// Argument: Enable chunked context
21942206
benchmarkParams.enableChunkedContext = result["enable_chunked_context"].as<bool>();
21952207

Diff for: benchmarks/cpp/utils/prepare_real_data.py

-2
Original file line numberDiff line numberDiff line change
@@ -231,8 +231,6 @@ def dataset(root_args, **kwargs):
231231
}, root_args.output)
232232
else:
233233
print_dataset(
234-
task_ids,
235234
input_ids,
236235
output_lens,
237-
tokenizer=None,
238236
)

Diff for: benchmarks/python/all_reduce.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def allreduce_benchmark(dtype: str,
4141
torch.cuda.set_device(local_rank)
4242
cudart.cudaSetDevice(local_rank)
4343

44-
mapping = Mapping(world_size, rank, gpus_per_node, world_size)
44+
mapping = Mapping(world_size, rank, gpus_per_node, tp_size=world_size)
4545

4646
if world_size == 1:
4747
raise RuntimeError("Benchmark must run with mpi_world_size > 1")

Diff for: benchmarks/python/enc_dec_benchmark.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ def read_config(component):
9393

9494
cross_attention = pretrained_config[
9595
"architecture"] == "DecoderModel"
96-
skip_cross_qkv = pretrained_config.get('skip_cross_qkv', False)
96+
skip_cross_kv = pretrained_config.get('skip_cross_kv', False)
9797
has_position_embedding = pretrained_config[
9898
"has_position_embedding"]
9999
has_token_type_embedding = hasattr(pretrained_config,
@@ -138,7 +138,7 @@ def read_config(component):
138138
lora_target_modules=lora_config.get('lora_target_modules'),
139139
trtllm_modules_to_hf_modules=lora_config.get(
140140
'trtllm_modules_to_hf_modules'),
141-
skip_cross_qkv=skip_cross_qkv,
141+
skip_cross_kv=skip_cross_kv,
142142
)
143143

144144
# additional info for benchmark
+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
/*
2+
* Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
#pragma once
18+
19+
#include "tensorrt_llm/runtime/cudaEvent.h"
20+
#include <atomic>
21+
#include <condition_variable>
22+
#include <mutex>
23+
#include <vector>
24+
25+
namespace tensorrt_llm::batch_manager
26+
{
27+
28+
// Use to track progress of context phase in dist-serving
29+
class ContextProgress
30+
{
31+
public:
32+
ContextProgress(int numLayers);
33+
34+
void recordEvent(int layerIdx, cudaStream_t stream);
35+
36+
void wait(int layerIdx);
37+
38+
int getNumLayers() const
39+
{
40+
return mCudaEvents.size();
41+
}
42+
43+
cudaEvent_t getEvent(int layerIdx)
44+
{
45+
return mCudaEvents.at(layerIdx).get();
46+
}
47+
48+
private:
49+
std::mutex mMutex;
50+
std::condition_variable mConditionVariable;
51+
std::unique_ptr<std::atomic_bool[]> mCudaEventsRecorded;
52+
std::vector<runtime::CudaEvent> mCudaEvents;
53+
};
54+
55+
} // namespace tensorrt_llm::batch_manager
+164
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
/*
2+
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
#pragma once
18+
19+
#include "tensorrt_llm/common/mpiUtils.h"
20+
#include "tensorrt_llm/runtime/eagleBuffers.h"
21+
#include "tensorrt_llm/runtime/explicitDraftTokensBuffers.h"
22+
#include "tensorrt_llm/runtime/iTensor.h"
23+
#include "tensorrt_llm/runtime/lookaheadBuffers.h"
24+
#include "tensorrt_llm/runtime/modelConfig.h"
25+
#include "tensorrt_llm/runtime/worldConfig.h"
26+
27+
#include <optional>
28+
#include <vector>
29+
30+
namespace tensorrt_llm::runtime
31+
{
32+
class TllmRuntime;
33+
} // namespace tensorrt_llm::runtime
34+
35+
namespace tensorrt_llm::batch_manager
36+
{
37+
38+
class DecoderStepAsyncSend
39+
{
40+
public:
41+
using BufferPtr = runtime::IBuffer::SharedPtr;
42+
43+
DecoderStepAsyncSend(std::shared_ptr<mpi::MpiComm> const& commSession, BufferPtr const& newOutputTokensHost,
44+
BufferPtr const& finished, BufferPtr const& sequenceLengthsHost, BufferPtr const& cumLogProbsHost,
45+
BufferPtr const& logProbsHost, BufferPtr const& cacheIndirectionOutput, BufferPtr const& acceptedCumSum,
46+
BufferPtr const& packedPaths, BufferPtr const& finishReasonsHost, int peer);
47+
48+
~DecoderStepAsyncSend();
49+
50+
private:
51+
std::shared_ptr<mpi::MpiRequest> mRequest1;
52+
std::shared_ptr<mpi::MpiRequest> mRequest2;
53+
std::shared_ptr<mpi::MpiRequest> mRequest3;
54+
std::shared_ptr<mpi::MpiRequest> mRequest4;
55+
std::shared_ptr<mpi::MpiRequest> mRequest5;
56+
std::shared_ptr<mpi::MpiRequest> mRequest6;
57+
std::shared_ptr<mpi::MpiRequest> mRequest7;
58+
std::shared_ptr<mpi::MpiRequest> mRequest8;
59+
std::shared_ptr<mpi::MpiRequest> mRequest9;
60+
};
61+
62+
class DecoderSlotAsyncSend
63+
{
64+
public:
65+
using TensorPtr = runtime::ITensor::SharedPtr;
66+
67+
DecoderSlotAsyncSend(std::shared_ptr<mpi::MpiComm> const& commSession, TensorPtr const& outputIdsView,
68+
TensorPtr const& sequenceLengthView, TensorPtr const& cumLogProbsView, TensorPtr const& logProbsView,
69+
bool returnLogProbs, int peer);
70+
71+
~DecoderSlotAsyncSend();
72+
73+
private:
74+
std::shared_ptr<mpi::MpiRequest> mRequest1;
75+
std::shared_ptr<mpi::MpiRequest> mRequest2;
76+
std::shared_ptr<mpi::MpiRequest> mRequest3;
77+
std::shared_ptr<mpi::MpiRequest> mRequest4;
78+
};
79+
80+
class DecoderBuffers
81+
{
82+
public:
83+
using SizeType32 = runtime::SizeType32;
84+
using TensorPtr = runtime::ITensor::SharedPtr;
85+
86+
std::vector<TensorPtr> logits;
87+
TensorPtr slotOutputIds; // [mMaxNumRequests, beamWidth, maxSeqLen], outputIds of all batch slots
88+
TensorPtr slotOutputIdsHost; // [beamWidth, maxSeqLen], outputIds of single batch slot
89+
TensorPtr cacheIndirectionInput;
90+
TensorPtr cacheIndirectionOutput;
91+
TensorPtr sequenceLengths; // [mMaxNumRequests]
92+
TensorPtr sequenceLengthsHost; // [mMaxNumRequests] pinned host tensor
93+
TensorPtr finished; // [mMaxNumRequests] pinned host tensor
94+
TensorPtr newOutputTokens; // [maxTokensPerStep, mMaxNumRequests, beamWidth]
95+
TensorPtr newOutputTokensHost; // [maxTokensPerStep, mMaxNumRequests, beamWidth]
96+
TensorPtr cumLogProbs; // [mMaxNumRequests, beamWidth]
97+
TensorPtr cumLogProbsHost; // [mMaxNumRequests, beamWidth]
98+
TensorPtr logProbs; // [mMaxNumRequests, beamWidth, maxSeqLen]
99+
TensorPtr logProbsHost; // [mMaxNumRequests, beamWidth, maxSeqLen]
100+
TensorPtr finishReasonsHost; // [mMaxNumRequests, beamWidth]
101+
102+
class DraftBuffers
103+
{
104+
public:
105+
TensorPtr nextDraftTokensDevice; // [mMaxNumRequests, maxTokensPerStep-1]
106+
TensorPtr nextDraftTokensHost; // [mMaxNumRequests, maxTokensPerStep-1]
107+
TensorPtr prevDraftTokensLengthsDevice; // [mMaxNumRequests]
108+
TensorPtr prevDraftTokensLengthsHost; // [mMaxNumRequests]
109+
TensorPtr nextDraftTokensLengthsDevice; // [mMaxNumRequests]
110+
TensorPtr nextDraftTokensLengthsHost; // [mMaxNumRequests]
111+
TensorPtr acceptedLengthsCumSumDevice; // [mMaxNumRequests+1]
112+
TensorPtr acceptedPackedPathsDevice; // [mMaxNumRequests * maxAcceptedTokens]
113+
std::vector<std::vector<runtime::ITensor::SharedPtr>>
114+
predictedDraftLogits; // [mMaxNumRequests][mMaxNumHeads][maxDraftTokens + 1, vocabSize]
115+
116+
void create(SizeType32 maxNumSequences, SizeType32 maxTokensPerStep, runtime::TllmRuntime const& runtime,
117+
runtime::ModelConfig const& modelConfig);
118+
};
119+
120+
DraftBuffers draftBuffers;
121+
runtime::ExplicitDraftTokensBuffers::Inputs explicitDraftTokensBuffers;
122+
runtime::EagleBuffers::Inputs eagleBuffers;
123+
std::optional<runtime::LookaheadDecodingBuffers> lookaheadBuffers;
124+
125+
DecoderBuffers(SizeType32 maxNumSequences, SizeType32 maxBeamWidth, SizeType32 maxAttentionWindow,
126+
SizeType32 maxSeqLen, SizeType32 maxTokensPerStep, runtime::TllmRuntime const& runtime,
127+
runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig);
128+
129+
std::unique_ptr<DecoderStepAsyncSend> asyncSend(std::shared_ptr<mpi::MpiComm> const& commSession,
130+
bool returnLogProbs, SizeType32 maxBeamWidth, bool useMedusa, int peer);
131+
132+
void recv(std::shared_ptr<mpi::MpiComm> const& commSession, bool returnLogProbs, SizeType32 maxBeamWidth,
133+
bool useMedusa, int peer);
134+
};
135+
136+
class SlotDecoderBuffers
137+
{
138+
public:
139+
using SizeType32 = runtime::SizeType32;
140+
using TensorPtr = runtime::ITensor::SharedPtr;
141+
142+
TensorPtr outputIds; // [beamWidth, maxSeqLen], outputIds of single batch slot
143+
TensorPtr outputIdsHost; // [beamWidth, maxSeqLen], outputIds of single batch slot
144+
TensorPtr sequenceLengthsHost; // [beamWidth]
145+
TensorPtr cumLogProbs; // [beamWidth]
146+
TensorPtr cumLogProbsHost; // [beamWidth]
147+
TensorPtr logProbs; // [beamWidth, maxSeqLen]
148+
TensorPtr logProbsHost; // [beamWidth, maxSeqLen]
149+
TensorPtr finishReasonsHost; // [beamWidth]
150+
151+
SlotDecoderBuffers(SizeType32 maxBeamWidth, SizeType32 maxSeqLen, runtime::TllmRuntime const& runtime);
152+
153+
static std::unique_ptr<DecoderSlotAsyncSend> asyncSend(std::shared_ptr<mpi::MpiComm> const& commSession,
154+
TensorPtr const& outputIdsView, TensorPtr const& sequenceLengthView, TensorPtr const& cumLogProbsView,
155+
TensorPtr const& logProbsView, bool returnLogProbs, int peer);
156+
157+
std::unique_ptr<DecoderSlotAsyncSend> asyncSend(std::shared_ptr<mpi::MpiComm> const& commSession,
158+
TensorPtr const& sequenceLengthView, bool returnLogProbs, int peer);
159+
160+
void recv(std::shared_ptr<mpi::MpiComm> const& commSession, TensorPtr const& sequenceLengthView,
161+
bool returnLogProbs, int peer);
162+
};
163+
164+
} // namespace tensorrt_llm::batch_manager

0 commit comments

Comments
 (0)