Update TensorRT-LLM (NVIDIA#524)

eshamanideep · Dec 1, 2023 · 71f60f6 · 71f60f6
1 parent 711a28d
commit 71f60f6
Show file tree

Hide file tree

Showing 464 changed files with 2,098,071 additions and 6,789 deletions.
diff --git a/.gitignore b/.gitignore
@@ -18,6 +18,10 @@ venv/
 .hypothesis/
 .idea/
 cpp/cmake-build-*
+cpp/.ccache/
+tensorrt_llm/libs
+tensorrt_llm/bindings.pyi
+tensorrt_llm/bindings/*.pyi
 
 # Testing
 .coverage.*

diff --git a/.gitmodules b/.gitmodules
@@ -1,7 +1,6 @@
 [submodule "3rdparty/cutlass"]
 	path = 3rdparty/cutlass
 	url = https://github.com/NVIDIA/cutlass.git
-	branch = v2.10.0
 [submodule "3rdparty/json"]
 	path = 3rdparty/json
 	url = https://github.com/nlohmann/json.git

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -15,7 +15,7 @@ repos:
     rev: v4.1.0
     hooks:
     -   id: check-added-large-files
-        exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin'
+        exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
     -   id: check-merge-conflict
     -   id: check-symlinks
     -   id: detect-private-key
@@ -33,9 +33,7 @@ repos:
     -   id: clang-format
         types_or: [c++, c, cuda]
         exclude: |
-            (?x)^(
-                cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/.*
-            )$
+            (?x)^(.*cubin.cpp$ | .*fmha_cubin.h)$
 -   repo: https://github.com/cheshirekow/cmake-format-precommit
     rev: v0.6.10
     hooks:

diff --git a/README.md b/README.md
@@ -36,7 +36,6 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
 [2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
 [2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
 
-
 ## Table of Contents
 
 - [TensorRT-LLM Overview](#tensorrt-llm-overview)
@@ -186,7 +185,8 @@ TensorRT-LLM is rigorously tested on the following GPUs:
 
 * [H100](https://www.nvidia.com/en-us/data-center/h100/)
 * [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
-* [A100](https://www.nvidia.com/en-us/data-center/a100/)/[A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
+* [A100](https://www.nvidia.com/en-us/data-center/a100/)
+* [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
 * [V100](https://www.nvidia.com/en-us/data-center/v100/) (experimental)
 
 If a GPU is not listed above, it is important to note that TensorRT-LLM is
@@ -254,14 +254,18 @@ The list of supported models is:
 * [LLaMA-v2](examples/llama)
 * [Mistral](examples/llama)
 * [MPT](examples/mpt)
+* [mT5](examples/enc_dec)
 * [OPT](examples/opt)
 * [Qwen](examples/qwen)
 * [Replit Code](examples/mpt)
 * [SantaCoder](examples/gpt)
 * [StarCoder](examples/gpt)
 * [T5](examples/enc_dec)
 
-Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easier.
+Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder
+support that contains many encoder-decoder models such as T5, Flan-T5, etc. We
+unroll the exact model names in the list above to let users find specific
+models easier.
 
 ## Performance
 
@@ -325,7 +329,11 @@ enable plugins, for example: `--use_gpt_attention_plugin`.
 
 * MPI + Slurm
 
-TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
+TensorRT-LLM is a
+[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package
+that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are
+running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might
+encounter interferences:
 ```
 --------------------------------------------------------------------------
 PMI2_Init failed to initialize.  Return code: 14
@@ -347,19 +355,123 @@ SLURM, depending upon the SLURM version you are using:
 Please configure as appropriate and try again.
 --------------------------------------------------------------------------
 ```
-As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
+As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
+node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
+dedicated MPI environment, not the one provided by your Slurm allocation.
+
 For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
 
 ## Release notes
 
-  * TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
+  * TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
 
 ### Change Log
 
+#### Version 0.6.0
+
+  * Models
+      * ChatGLM3
+      * InternLM (contributed by @wangruohui)
+      * Mistral 7B (developed in collaboration with Mistral.AI)
+      * MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
+      * Qwen (contributed by @Tlntin and @zhaohb)
+      * Replit Code V-1.5 3B (external contribution)
+      * T5, mT5, Flan-T5 (Python runtime only)
+
+  * Features
+      * Add runtime statistics related to active requests and KV cache
+        utilization from the batch manager (see
+        the [batch manager](docs/source/batch_manager.md) documentation)
+      * Add `sequence_length` tensor to support proper lengths in beam-search
+        (when beam-width > 1 - see
+        [tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * BF16 support for encoder-decoder models (Python runtime - see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * Improvements to memory utilization (CPU and GPU - including memory
+        leaks)
+      * Improved error reporting and memory consumption
+      * Improved support for stop and bad words
+      * INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
+        [examples/baichuan](examples/baichuan/README.md))
+      * INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
+        support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
+      * INT4 AWQ support for the Falcon models
+        (see [examples/falcon](examples/falcon/README.md))
+      * LoRA support (functional preview only - limited to the Python runtime,
+        only QKV support and not optimized in terms of runtime performance) for
+        the GPT model (see the
+        [Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
+        in the GPT example)
+      * Multi-GPU support for encoder-decoder models (Python runtime - see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * New heuristic for launching the Multi-block Masked MHA kernel (similar
+        to FlashDecoding - see
+        [decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
+      * Prompt-Tuning support for GPT and LLaMA models (see the
+        [Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
+      * Performance optimizations in various CUDA kernels
+      * Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
+        [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
+      * Support for different micro batch sizes for context and generation
+        phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
+        `GptSession::Config::genMicroBatchSize` in
+        [tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
+      * Support for "remove input padding" for encoder-decoder models (see
+        [examples/enc_dec](examples/enc_dec/README.md))
+      * Support for context and generation logits (see `mComputeContextLogits` and
+        `mComputeGenerationLogits` in
+        [tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
+      * Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
+        `"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+      * Update to CUTLASS 3.x
+
+  * Bug fixes
+      * Fix for ChatGLM2 #93 and #138
+      * Fix tensor names error "RuntimeError: Tensor names
+        (`host_max_kv_cache_length`) in engine are not the same as expected in
+        the main branch" #369
+      * Fix weights split issue in BLOOM when `world_size = 2` ("array split
+        does not result in an equal division") #374
+      * Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
+      * Fix a crash in GenerationSession if stream keyword argument is not None
+        #202
+      * Fix a typo when calling PyNVML API [BUG] code bug #410
+      * Fix bugs related to the improper management of the `end_id` for various
+        models [C++ and Python]
+      * Fix memory leaks [C++ code and Python models]
+      * Fix the std::alloc error when running the gptManagerBenchmark -- issue
+        gptManagerBenchmark std::bad_alloc error #66
+      * Fix a bug in pipeline parallelism when beam-width > 1
+      * Fix a bug with Llama GPTQ due to improper support of GQA
+      * Fix issue #88
+      * Fix an issue with the Huggingface Transformers version #16
+      * Fix link jump in windows readme.md #30 - by @yuanlehome
+      * Fix typo in batchScheduler.h #56 - by @eltociear
+      * Fix typo #58 - by @RichardScottOZ
+      * Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
+        builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
+      * Fix the log message to be more accurate on KV cache #224
+      * Fix Windows release wheel installation: Failed to install the release
+        wheel for Windows using pip #261
+      * Fix missing torch dependencies: [BUG] The batch_manage.a choice error
+        in --cpp-only when torch's cxx_abi version is different with gcc #151
+      * Fix linking error during compiling google-test & benchmarks #277
+      * Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
+        the lack of bfloat16 #335
+      * Minor bug fixes
+
+#### Version 0.5.0
+
   * TensorRT-LLM v0.5.0 is the first public release.
 
 ### Known Issues
 
+  * The hang reported in issue
+    [#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
+    has not been reproduced by the TensorRT-LLM team. If it is caused by a bug
+    in TensorRT-LLM, that bug may be present in that release
+
 ### Report Issues
 
 You can use GitHub issues to report issues with TensorRT-LLM.
diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -18,9 +18,13 @@ instead, and be sure to set DLL paths as specified in
 
 ### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
 
+#### Prepare TensorRT-LLM engine(s)
+
 Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
 
-You can reuse the engine built by benchmarking code for Python Runtime, please see that [`document`](../python/README.md).
+You can use the [`build.py`](../python/build.py) script to build the engine(s). Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built by benchmarking code, please see that [`document`](../python/README.md).
+
+####  Launch benchmarking
 
 For detailed usage, you can do the following
 ```