[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime #2500

gujiewen · 2024-11-26T07:08:50Z

Recently trt_llm v0.16 has supported qwen2vl. However, it use c++ runtime, and c++ runtime seems deal with the whole inference. I mean c++ runtime get the input and give me the final output sequence.

        multi_responses = self.session.await_responses(request_ids)
        responses = [
            response for responses in multi_responses for response in responses
        ]

        return self._fill_output(responses, output_ids, end_id, return_dict,
                                 output_sequence_lengths, output_log_probs,
                                 output_cum_log_probs, batch_input_ids, [],
                                 streaming, request_ids, False, max_new_tokens,
                                 sampling_config, is_draft_target_model)

and in _fill_output

        for response in responses:
            if response.has_error():
                raise RuntimeError(response.error_msg)

            result = response.result
            batch_idx = request_ids.index(response.request_id)
            if is_beam_search:
                for beam, output_tokens in enumerate(result.output_token_ids):
                    fill_output_ids(output_tokens, batch_idx, beam)
            else:
                fill_output_ids(result.output_token_ids[0], batch_idx,
                                result.sequence_index)

        if output_sequence_lengths:
            sequence_lengths = [[len(token_ids) for token_ids in beams]
                                for beams in output_ids]

So how could I get the accurate TTFT and TPOT respectly?

The text was updated successfully, but these errors were encountered:

hello-11 · 2024-12-10T05:51:32Z

@gujiewen if you want to get the prefill latency, you can set the out_length=1.

gujiewen · 2024-12-13T09:53:13Z

@gujiewen if you want to get the prefill latency, you can set the out_length=1.

How about the TPOT

gujiewen changed the title ~~[QST] How to get the prefill time and decode speed resepectly when using C++ runtime~~ [QST] How to get the prefill latency and TPOT resepectly when using C++ runtime Nov 26, 2024

hello-11 added the question Further information is requested label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime #2500

[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime #2500

gujiewen commented Nov 26, 2024

hello-11 commented Dec 10, 2024

gujiewen commented Dec 13, 2024

[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime #2500

[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime #2500

Comments

gujiewen commented Nov 26, 2024

hello-11 commented Dec 10, 2024

gujiewen commented Dec 13, 2024