[Performance] TTFT of qwen2.5 0.5B model

### System Info

A100 80GB PCIe/ A100 Multi-Instance GPU (MIG)

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Download Qwen2.5 0.5B model and build engine using int8 smoothquant
```
cd examples/qwen
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
python ./convert_checkpoint.py --model_dir Qwen2.5-0.5B --output_dir ./tmp/sq0.5 --dtype float16 --smoothquant 0.5 --per_token --per_channel --int8_kv_cache
trtllm-build --checkpoint_dir ./tmp/sq0.5 --output_dir ./trt_engines --gemm_plugin float16 --gpt_attention_plugin float16 --max_input_len 384 --max_seq_len 385 --max_batch_size 7 --remove_input_padding enable --context_fmha enable --gather_generation_logits
```
Prepare test data
```
python ../../benchmarks/cpp/prepare_dataset.py --tokenizer=./Qwen2.5-0.5B/ --stdout token-norm-dist --num-requests=700 --input-mean=384 --output-mean=1 --input-stdev=0 --output-stdev=0 > ./output
```
Test time to first token
```
python test_TTFT.py --batch_size 7 --engine_dir ./trt_engines/ --input_file output
```
The content of test_TTFT is as follows:
```
import json
import time
import argparse
import numpy as np
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunnerCpp
import torch

def main(batch_size, engine_dir, input_file):
    input_ids_list = []
    with open(input_file, "r", encoding="utf-8") as f:
        for line in f.readlines():
            input_ids_list.append(json.loads(line)["input_ids"])

    batches = [input_ids_list[i:i + batch_size] for i in range(0, len(input_ids_list), batch_size)]

    runner_kwargs = dict(engine_dir=engine_dir)
    runner_kwargs.update(
                    max_batch_size=batch_size,
                    max_input_len=384,
                    max_output_len=1
                )
    runner = ModelRunnerCpp.from_dir(**runner_kwargs)

    time_costs = []
    for batch in batches:
        batch_input_ids = [torch.IntTensor(inp) for inp in batch]
        start = time.perf_counter()
        outputs = runner.generate(
                        batch_input_ids=batch_input_ids,
                        max_new_tokens=1,
                        return_dict=True
                    )
        end = time.perf_counter()
        time_costs.append((end - start) * 1000)

    print("P50 time cost: ", np.percentile(time_costs, 50))
    print("P95 time cost: ", np.percentile(time_costs, 95))

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate data using TensorRT LLM")
    parser.add_argument("--batch_size", type=int, required=True, help="Batch size for processing")
    parser.add_argument("--engine_dir", type=str, required=True, help="Directory of the engine")
    parser.add_argument("--input_file", type=str, required=True, help="Input file containing data")

    args = parser.parse_args()
    main(args.batch_size, args.engine_dir, args.input_file)
```

### Expected behavior

We anticipate that with a batch size of 7, a sequence length of 384, and an output token of 1, the engine's inference speed could improve. Do you have any recommendations or insights?

### actual behavior
A100:
P50 time cost:  14.663105830550194ms
P95 time cost:  15.2217005379498ms
A100 MIG:
P50 time cost:  81.50463068708777ms
P95 time cost:  82.2397278137505ms

### additional notes

transformer version: 4.42.3
TensorRT-LLM version: "0.16.0.dev2024111900"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] TTFT of qwen2.5 0.5B model #2598

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] TTFT of qwen2.5 0.5B model #2598

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions