Skip to content

[Performance] TTFT of qwen2.5 0.5B model #2598

Open
@ReginaZh

Description

@ReginaZh

System Info

A100 80GB PCIe/ A100 Multi-Instance GPU (MIG)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Download Qwen2.5 0.5B model and build engine using int8 smoothquant

cd examples/qwen
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
python ./convert_checkpoint.py --model_dir Qwen2.5-0.5B --output_dir ./tmp/sq0.5 --dtype float16 --smoothquant 0.5 --per_token --per_channel --int8_kv_cache
trtllm-build --checkpoint_dir ./tmp/sq0.5 --output_dir ./trt_engines --gemm_plugin float16 --gpt_attention_plugin float16 --max_input_len 384 --max_seq_len 385 --max_batch_size 7 --remove_input_padding enable --context_fmha enable --gather_generation_logits

Prepare test data

python ../../benchmarks/cpp/prepare_dataset.py --tokenizer=./Qwen2.5-0.5B/ --stdout token-norm-dist --num-requests=700 --input-mean=384 --output-mean=1 --input-stdev=0 --output-stdev=0 > ./output

Test time to first token

python test_TTFT.py --batch_size 7 --engine_dir ./trt_engines/ --input_file output

The content of test_TTFT is as follows:

import json
import time
import argparse
import numpy as np
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunnerCpp
import torch

def main(batch_size, engine_dir, input_file):
    input_ids_list = []
    with open(input_file, "r", encoding="utf-8") as f:
        for line in f.readlines():
            input_ids_list.append(json.loads(line)["input_ids"])

    batches = [input_ids_list[i:i + batch_size] for i in range(0, len(input_ids_list), batch_size)]

    runner_kwargs = dict(engine_dir=engine_dir)
    runner_kwargs.update(
                    max_batch_size=batch_size,
                    max_input_len=384,
                    max_output_len=1
                )
    runner = ModelRunnerCpp.from_dir(**runner_kwargs)

    time_costs = []
    for batch in batches:
        batch_input_ids = [torch.IntTensor(inp) for inp in batch]
        start = time.perf_counter()
        outputs = runner.generate(
                        batch_input_ids=batch_input_ids,
                        max_new_tokens=1,
                        return_dict=True
                    )
        end = time.perf_counter()
        time_costs.append((end - start) * 1000)

    print("P50 time cost: ", np.percentile(time_costs, 50))
    print("P95 time cost: ", np.percentile(time_costs, 95))

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate data using TensorRT LLM")
    parser.add_argument("--batch_size", type=int, required=True, help="Batch size for processing")
    parser.add_argument("--engine_dir", type=str, required=True, help="Directory of the engine")
    parser.add_argument("--input_file", type=str, required=True, help="Input file containing data")

    args = parser.parse_args()
    main(args.batch_size, args.engine_dir, args.input_file)

Expected behavior

We anticipate that with a batch size of 7, a sequence length of 384, and an output token of 1, the engine's inference speed could improve. Do you have any recommendations or insights?

actual behavior

A100:
P50 time cost: 14.663105830550194ms
P95 time cost: 15.2217005379498ms
A100 MIG:
P50 time cost: 81.50463068708777ms
P95 time cost: 82.2397278137505ms

additional notes

transformer version: 4.42.3
TensorRT-LLM version: "0.16.0.dev2024111900"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions