Skip to content

Commit

Permalink
Update TensorRT-LLM (NVIDIA#524)
Browse files Browse the repository at this point in the history
  • Loading branch information
kaiyux authored Dec 1, 2023
1 parent 711a28d commit 71f60f6
Show file tree
Hide file tree
Showing 464 changed files with 2,098,071 additions and 6,789 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ venv/
.hypothesis/
.idea/
cpp/cmake-build-*
cpp/.ccache/
tensorrt_llm/libs
tensorrt_llm/bindings.pyi
tensorrt_llm/bindings/*.pyi

# Testing
.coverage.*
Expand Down
1 change: 0 additions & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
[submodule "3rdparty/cutlass"]
path = 3rdparty/cutlass
url = https://github.com/NVIDIA/cutlass.git
branch = v2.10.0
[submodule "3rdparty/json"]
path = 3rdparty/json
url = https://github.com/nlohmann/json.git
Expand Down
6 changes: 2 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ repos:
rev: v4.1.0
hooks:
- id: check-added-large-files
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin'
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
Expand All @@ -33,9 +33,7 @@ repos:
- id: clang-format
types_or: [c++, c, cuda]
exclude: |
(?x)^(
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/.*
)$
(?x)^(.*cubin.cpp$ | .*fmha_cubin.h)$
- repo: https://github.com/cheshirekow/cmake-format-precommit
rev: v0.6.10
hooks:
Expand Down
124 changes: 118 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
[2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
[2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);


## Table of Contents

- [TensorRT-LLM Overview](#tensorrt-llm-overview)
Expand Down Expand Up @@ -186,7 +185,8 @@ TensorRT-LLM is rigorously tested on the following GPUs:

* [H100](https://www.nvidia.com/en-us/data-center/h100/)
* [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
* [A100](https://www.nvidia.com/en-us/data-center/a100/)/[A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
* [A100](https://www.nvidia.com/en-us/data-center/a100/)
* [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)
* [V100](https://www.nvidia.com/en-us/data-center/v100/) (experimental)

If a GPU is not listed above, it is important to note that TensorRT-LLM is
Expand Down Expand Up @@ -254,14 +254,18 @@ The list of supported models is:
* [LLaMA-v2](examples/llama)
* [Mistral](examples/llama)
* [MPT](examples/mpt)
* [mT5](examples/enc_dec)
* [OPT](examples/opt)
* [Qwen](examples/qwen)
* [Replit Code](examples/mpt)
* [SantaCoder](examples/gpt)
* [StarCoder](examples/gpt)
* [T5](examples/enc_dec)

Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easier.
Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder
support that contains many encoder-decoder models such as T5, Flan-T5, etc. We
unroll the exact model names in the list above to let users find specific
models easier.

## Performance

Expand Down Expand Up @@ -325,7 +329,11 @@ enable plugins, for example: `--use_gpt_attention_plugin`.

* MPI + Slurm

TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
TensorRT-LLM is a
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package
that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are
running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might
encounter interferences:
```
--------------------------------------------------------------------------
PMI2_Init failed to initialize. Return code: 14
Expand All @@ -347,19 +355,123 @@ SLURM, depending upon the SLURM version you are using:
Please configure as appropriate and try again.
--------------------------------------------------------------------------
```
As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm
node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a
dedicated MPI environment, not the one provided by your Slurm allocation.

For example: `mpirun -n 1 python3 examples/gpt/build.py ...`

## Release notes

* TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
* TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.

### Change Log

#### Version 0.6.0

* Models
* ChatGLM3
* InternLM (contributed by @wangruohui)
* Mistral 7B (developed in collaboration with Mistral.AI)
* MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
* Qwen (contributed by @Tlntin and @zhaohb)
* Replit Code V-1.5 3B (external contribution)
* T5, mT5, Flan-T5 (Python runtime only)

* Features
* Add runtime statistics related to active requests and KV cache
utilization from the batch manager (see
the [batch manager](docs/source/batch_manager.md) documentation)
* Add `sequence_length` tensor to support proper lengths in beam-search
(when beam-width > 1 - see
[tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* BF16 support for encoder-decoder models (Python runtime - see
[examples/enc_dec](examples/enc_dec/README.md))
* Improvements to memory utilization (CPU and GPU - including memory
leaks)
* Improved error reporting and memory consumption
* Improved support for stop and bad words
* INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
[examples/baichuan](examples/baichuan/README.md))
* INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
* INT4 AWQ support for the Falcon models
(see [examples/falcon](examples/falcon/README.md))
* LoRA support (functional preview only - limited to the Python runtime,
only QKV support and not optimized in terms of runtime performance) for
the GPT model (see the
[Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
in the GPT example)
* Multi-GPU support for encoder-decoder models (Python runtime - see
[examples/enc_dec](examples/enc_dec/README.md))
* New heuristic for launching the Multi-block Masked MHA kernel (similar
to FlashDecoding - see
[decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
* Prompt-Tuning support for GPT and LLaMA models (see the
[Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
* Performance optimizations in various CUDA kernels
* Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
[`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
* Support for different micro batch sizes for context and generation
phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
`GptSession::Config::genMicroBatchSize` in
[tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
* Support for "remove input padding" for encoder-decoder models (see
[examples/enc_dec](examples/enc_dec/README.md))
* Support for context and generation logits (see `mComputeContextLogits` and
`mComputeGenerationLogits` in
[tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
* Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
`"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
* Update to CUTLASS 3.x

* Bug fixes
* Fix for ChatGLM2 #93 and #138
* Fix tensor names error "RuntimeError: Tensor names
(`host_max_kv_cache_length`) in engine are not the same as expected in
the main branch" #369
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
does not result in an equal division") #374
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
* Fix a crash in GenerationSession if stream keyword argument is not None
#202
* Fix a typo when calling PyNVML API [BUG] code bug #410
* Fix bugs related to the improper management of the `end_id` for various
models [C++ and Python]
* Fix memory leaks [C++ code and Python models]
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
gptManagerBenchmark std::bad_alloc error #66
* Fix a bug in pipeline parallelism when beam-width > 1
* Fix a bug with Llama GPTQ due to improper support of GQA
* Fix issue #88
* Fix an issue with the Huggingface Transformers version #16
* Fix link jump in windows readme.md #30 - by @yuanlehome
* Fix typo in batchScheduler.h #56 - by @eltociear
* Fix typo #58 - by @RichardScottOZ
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
* Fix the log message to be more accurate on KV cache #224
* Fix Windows release wheel installation: Failed to install the release
wheel for Windows using pip #261
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
in --cpp-only when torch's cxx_abi version is different with gcc #151
* Fix linking error during compiling google-test & benchmarks #277
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
the lack of bfloat16 #335
* Minor bug fixes

#### Version 0.5.0

* TensorRT-LLM v0.5.0 is the first public release.

### Known Issues

* The hang reported in issue
[#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
has not been reproduced by the TensorRT-LLM team. If it is caused by a bug
in TensorRT-LLM, that bug may be present in that release

### Report Issues

You can use GitHub issues to report issues with TensorRT-LLM.
6 changes: 5 additions & 1 deletion benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,13 @@ instead, and be sure to set DLL paths as specified in

### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)

#### Prepare TensorRT-LLM engine(s)

Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.

You can reuse the engine built by benchmarking code for Python Runtime, please see that [`document`](../python/README.md).
You can use the [`build.py`](../python/build.py) script to build the engine(s). Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built by benchmarking code, please see that [`document`](../python/README.md).

#### Launch benchmarking

For detailed usage, you can do the following
```
Expand Down
Loading

0 comments on commit 71f60f6

Please sign in to comment.