You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently trt_llm v0.16 has supported qwen2vl. However, it use c++ runtime, and c++ runtime seems deal with the whole inference. I mean c++ runtime get the input and give me the final output sequence.
multi_responses = self.session.await_responses(request_ids)
responses = [
response for responses in multi_responses for response in responses
]
return self._fill_output(responses, output_ids, end_id, return_dict,
output_sequence_lengths, output_log_probs,
output_cum_log_probs, batch_input_ids, [],
streaming, request_ids, False, max_new_tokens,
sampling_config, is_draft_target_model)
and in _fill_output
for response in responses:
if response.has_error():
raise RuntimeError(response.error_msg)
result = response.result
batch_idx = request_ids.index(response.request_id)
if is_beam_search:
for beam, output_tokens in enumerate(result.output_token_ids):
fill_output_ids(output_tokens, batch_idx, beam)
else:
fill_output_ids(result.output_token_ids[0], batch_idx,
result.sequence_index)
if output_sequence_lengths:
sequence_lengths = [[len(token_ids) for token_ids in beams]
for beams in output_ids]
So how could I get the accurate TTFT and TPOT respectly?
The text was updated successfully, but these errors were encountered:
gujiewen
changed the title
[QST] How to get the prefill time and decode speed resepectly when using C++ runtime
[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime
Nov 26, 2024
Recently trt_llm v0.16 has supported qwen2vl. However, it use c++ runtime, and c++ runtime seems deal with the whole inference. I mean c++ runtime get the input and give me the final output sequence.
and
in _fill_output
So how could I get the accurate TTFT and TPOT respectly?
The text was updated successfully, but these errors were encountered: