Open
Description
System Info
A100 80GB PCIe/ A100 Multi-Instance GPU (MIG)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Download Qwen2.5 0.5B model and build engine using int8 smoothquant
cd examples/qwen
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
python ./convert_checkpoint.py --model_dir Qwen2.5-0.5B --output_dir ./tmp/sq0.5 --dtype float16 --smoothquant 0.5 --per_token --per_channel --int8_kv_cache
trtllm-build --checkpoint_dir ./tmp/sq0.5 --output_dir ./trt_engines --gemm_plugin float16 --gpt_attention_plugin float16 --max_input_len 384 --max_seq_len 385 --max_batch_size 7 --remove_input_padding enable --context_fmha enable --gather_generation_logits
Prepare test data
python ../../benchmarks/cpp/prepare_dataset.py --tokenizer=./Qwen2.5-0.5B/ --stdout token-norm-dist --num-requests=700 --input-mean=384 --output-mean=1 --input-stdev=0 --output-stdev=0 > ./output
Test time to first token
python test_TTFT.py --batch_size 7 --engine_dir ./trt_engines/ --input_file output
The content of test_TTFT is as follows:
import json
import time
import argparse
import numpy as np
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunnerCpp
import torch
def main(batch_size, engine_dir, input_file):
input_ids_list = []
with open(input_file, "r", encoding="utf-8") as f:
for line in f.readlines():
input_ids_list.append(json.loads(line)["input_ids"])
batches = [input_ids_list[i:i + batch_size] for i in range(0, len(input_ids_list), batch_size)]
runner_kwargs = dict(engine_dir=engine_dir)
runner_kwargs.update(
max_batch_size=batch_size,
max_input_len=384,
max_output_len=1
)
runner = ModelRunnerCpp.from_dir(**runner_kwargs)
time_costs = []
for batch in batches:
batch_input_ids = [torch.IntTensor(inp) for inp in batch]
start = time.perf_counter()
outputs = runner.generate(
batch_input_ids=batch_input_ids,
max_new_tokens=1,
return_dict=True
)
end = time.perf_counter()
time_costs.append((end - start) * 1000)
print("P50 time cost: ", np.percentile(time_costs, 50))
print("P95 time cost: ", np.percentile(time_costs, 95))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate data using TensorRT LLM")
parser.add_argument("--batch_size", type=int, required=True, help="Batch size for processing")
parser.add_argument("--engine_dir", type=str, required=True, help="Directory of the engine")
parser.add_argument("--input_file", type=str, required=True, help="Input file containing data")
args = parser.parse_args()
main(args.batch_size, args.engine_dir, args.input_file)
Expected behavior
We anticipate that with a batch size of 7, a sequence length of 384, and an output token of 1, the engine's inference speed could improve. Do you have any recommendations or insights?
actual behavior
A100:
P50 time cost: 14.663105830550194ms
P95 time cost: 15.2217005379498ms
A100 MIG:
P50 time cost: 81.50463068708777ms
P95 time cost: 82.2397278137505ms
additional notes
transformer version: 4.42.3
TensorRT-LLM version: "0.16.0.dev2024111900"