Skip to content

Commit

Permalink
Update TensorRT-LLM (NVIDIA#1098)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM

* update submodule

* Remove unused binaries
  • Loading branch information
kaiyux authored Feb 18, 2024
1 parent 0ab9d17 commit 0f041b7
Show file tree
Hide file tree
Showing 231 changed files with 11,413 additions and 4,628 deletions.
2 changes: 1 addition & 1 deletion 3rdparty/cutlass
Submodule cutlass updated 246 files
18 changes: 16 additions & 2 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Run a preprocessing script to prepare/generate dataset into a json that gptManag

This tool can be used in 2 different modes of traffic generation.

1 – Dataset
##### 1 – Dataset

“Prompt”, “Instruction” (optional) and “Answer” specified as sentences in a Json file

Expand All @@ -90,7 +90,7 @@ python3 prepare_dataset.py \
--max-input-len 300
```

2 – Normal token length distribution
##### 2 – Normal token length distribution

This mode allows the user to generate normal token length distributions with a mean and std deviation specified.
For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.
Expand Down Expand Up @@ -140,3 +140,17 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
--max_num_samples 500
```

To emulate `gptSessionBenchmark` static batching, you can use the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated-timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count.

Take GPT-350M as an example for single GPU with static batching
```
./benchmarks/gptManagerBenchmark \
--model gpt \
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
--type IFB \
--static_emulated_batch_size 32 \
--static_emulated_timeout 100 \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
```
Binary file removed benchmarks/cpp/bertBenchmark
Binary file not shown.
Loading

0 comments on commit 0f041b7

Please sign in to comment.