You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import json
import time
import argparse
import numpy as np
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunnerCpp
import torch
def main(batch_size, engine_dir, input_file):
input_ids_list = []
with open(input_file, "r", encoding="utf-8") as f:
for line in f.readlines():
input_ids_list.append(json.loads(line)["input_ids"])
batches = [input_ids_list[i:i + batch_size] for i in range(0, len(input_ids_list), batch_size)]
runner_kwargs = dict(engine_dir=engine_dir)
runner_kwargs.update(
max_batch_size=batch_size,
max_input_len=384,
max_output_len=1
)
runner = ModelRunnerCpp.from_dir(**runner_kwargs)
time_costs = []
for batch in batches:
batch_input_ids = [torch.IntTensor(inp) for inp in batch]
start = time.perf_counter()
outputs = runner.generate(
batch_input_ids=batch_input_ids,
max_new_tokens=1,
return_dict=True
)
end = time.perf_counter()
time_costs.append((end - start) * 1000)
print("P50 time cost: ", np.percentile(time_costs, 50))
print("P95 time cost: ", np.percentile(time_costs, 95))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate data using TensorRT LLM")
parser.add_argument("--batch_size", type=int, required=True, help="Batch size for processing")
parser.add_argument("--engine_dir", type=str, required=True, help="Directory of the engine")
parser.add_argument("--input_file", type=str, required=True, help="Input file containing data")
args = parser.parse_args()
main(args.batch_size, args.engine_dir, args.input_file)
Expected behavior
We anticipate that with a batch size of 7, a sequence length of 384, and an output token of 1, the engine's inference speed could improve. Do you have any recommendations or insights?
actual behavior
A100:
P50 time cost: 14.663105830550194ms
P95 time cost: 15.2217005379498ms
A100 MIG:
P50 time cost: 81.50463068708777ms
P95 time cost: 82.2397278137505ms
System Info
A100 80GB PCIe/ A100 Multi-Instance GPU (MIG)
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Download Qwen2.5 0.5B model and build engine using int8 smoothquant
Prepare test data
Test time to first token
The content of test_TTFT is as follows:
Expected behavior
We anticipate that with a batch size of 7, a sequence length of 384, and an output token of 1, the engine's inference speed could improve. Do you have any recommendations or insights?
actual behavior
A100:
P50 time cost: 14.663105830550194ms
P95 time cost: 15.2217005379498ms
A100 MIG:
P50 time cost: 81.50463068708777ms
P95 time cost: 82.2397278137505ms
additional notes
transformer version: 4.42.3
TensorRT-LLM version: "0.16.0.dev2024111900"
The text was updated successfully, but these errors were encountered: