Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to get the prefill latency and TPOT resepectly when using C++ runtime #2500

Open
gujiewen opened this issue Nov 26, 2024 · 2 comments
Labels
question Further information is requested

Comments

@gujiewen
Copy link

Recently trt_llm v0.16 has supported qwen2vl. However, it use c++ runtime, and c++ runtime seems deal with the whole inference. I mean c++ runtime get the input and give me the final output sequence.

        multi_responses = self.session.await_responses(request_ids)
        responses = [
            response for responses in multi_responses for response in responses
        ]

        return self._fill_output(responses, output_ids, end_id, return_dict,
                                 output_sequence_lengths, output_log_probs,
                                 output_cum_log_probs, batch_input_ids, [],
                                 streaming, request_ids, False, max_new_tokens,
                                 sampling_config, is_draft_target_model)

and in _fill_output

        for response in responses:
            if response.has_error():
                raise RuntimeError(response.error_msg)

            result = response.result
            batch_idx = request_ids.index(response.request_id)
            if is_beam_search:
                for beam, output_tokens in enumerate(result.output_token_ids):
                    fill_output_ids(output_tokens, batch_idx, beam)
            else:
                fill_output_ids(result.output_token_ids[0], batch_idx,
                                result.sequence_index)

        if output_sequence_lengths:
            sequence_lengths = [[len(token_ids) for token_ids in beams]
                                for beams in output_ids]

So how could I get the accurate TTFT and TPOT respectly?

@gujiewen gujiewen changed the title [QST] How to get the prefill time and decode speed resepectly when using C++ runtime [QST] How to get the prefill latency and TPOT resepectly when using C++ runtime Nov 26, 2024
@hello-11
Copy link
Collaborator

@gujiewen if you want to get the prefill latency, you can set the out_length=1.

@hello-11 hello-11 added the question Further information is requested label Dec 10, 2024
@gujiewen
Copy link
Author

@gujiewen if you want to get the prefill latency, you can set the out_length=1.

How about the TPOT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants