Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] TTFT of qwen2.5 0.5B model #2598

Open
4 tasks
ReginaZh opened this issue Dec 20, 2024 · 0 comments
Open
4 tasks

[Performance] TTFT of qwen2.5 0.5B model #2598

ReginaZh opened this issue Dec 20, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ReginaZh
Copy link
Contributor

ReginaZh commented Dec 20, 2024

System Info

A100 80GB PCIe/ A100 Multi-Instance GPU (MIG)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Download Qwen2.5 0.5B model and build engine using int8 smoothquant

cd examples/qwen
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
python ./convert_checkpoint.py --model_dir Qwen2.5-0.5B --output_dir ./tmp/sq0.5 --dtype float16 --smoothquant 0.5 --per_token --per_channel --int8_kv_cache
trtllm-build --checkpoint_dir ./tmp/sq0.5 --output_dir ./trt_engines --gemm_plugin float16 --gpt_attention_plugin float16 --max_input_len 384 --max_seq_len 385 --max_batch_size 7 --remove_input_padding enable --context_fmha enable --gather_generation_logits

Prepare test data

python ../../benchmarks/cpp/prepare_dataset.py --tokenizer=./Qwen2.5-0.5B/ --stdout token-norm-dist --num-requests=700 --input-mean=384 --output-mean=1 --input-stdev=0 --output-stdev=0 > ./output

Test time to first token

python test_TTFT.py --batch_size 7 --engine_dir ./trt_engines/ --input_file output

The content of test_TTFT is as follows:

import json
import time
import argparse
import numpy as np
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunnerCpp
import torch

def main(batch_size, engine_dir, input_file):
    input_ids_list = []
    with open(input_file, "r", encoding="utf-8") as f:
        for line in f.readlines():
            input_ids_list.append(json.loads(line)["input_ids"])

    batches = [input_ids_list[i:i + batch_size] for i in range(0, len(input_ids_list), batch_size)]

    runner_kwargs = dict(engine_dir=engine_dir)
    runner_kwargs.update(
                    max_batch_size=batch_size,
                    max_input_len=384,
                    max_output_len=1
                )
    runner = ModelRunnerCpp.from_dir(**runner_kwargs)

    time_costs = []
    for batch in batches:
        batch_input_ids = [torch.IntTensor(inp) for inp in batch]
        start = time.perf_counter()
        outputs = runner.generate(
                        batch_input_ids=batch_input_ids,
                        max_new_tokens=1,
                        return_dict=True
                    )
        end = time.perf_counter()
        time_costs.append((end - start) * 1000)

    print("P50 time cost: ", np.percentile(time_costs, 50))
    print("P95 time cost: ", np.percentile(time_costs, 95))

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate data using TensorRT LLM")
    parser.add_argument("--batch_size", type=int, required=True, help="Batch size for processing")
    parser.add_argument("--engine_dir", type=str, required=True, help="Directory of the engine")
    parser.add_argument("--input_file", type=str, required=True, help="Input file containing data")

    args = parser.parse_args()
    main(args.batch_size, args.engine_dir, args.input_file)

Expected behavior

We anticipate that with a batch size of 7, a sequence length of 384, and an output token of 1, the engine's inference speed could improve. Do you have any recommendations or insights?

actual behavior

A100:
P50 time cost: 14.663105830550194ms
P95 time cost: 15.2217005379498ms
A100 MIG:
P50 time cost: 81.50463068708777ms
P95 time cost: 82.2397278137505ms

additional notes

transformer version: 4.42.3
TensorRT-LLM version: "0.16.0.dev2024111900"

@ReginaZh ReginaZh added the bug Something isn't working label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant