-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Metrics] Add TTFT and TPOT histograms #12530
[V1][Metrics] Add TTFT and TPOT histograms #12530
Conversation
Compute these intervals on the API server side as EngineCoreOutputs are processed. The OutputProcessor needs to track per-request arrival times and the generation timestamp of the most recent token generation. Prefill intervals are calculated when the first EngineCoreOutput is received, indicating prefill completion. The histogram buckets match the v0 implementation, but they should probably be improved in future. Signed-off-by: Mark McLoughlin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@@ -48,6 +52,8 @@ def update_from_output(self, output: "EngineCoreOutput", | |||
return | |||
|
|||
num_new_generation_tokens = len(output.new_token_ids) | |||
now = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen mixed use of time.time()
and time.monotonic()
- I think we should standardize on one or the other so we don't have a footgun down the road
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
I had initially changed to time.monotonic()
but I backed it out because it looks like arrival_time
being wall-clock time is part of the public API, so not something I wanted to rush into.
I'll make sure to do it or capture it as a TODO/FIXME before I wrap this up 👍
Signed-off-by: Mark McLoughlin <[email protected]>
Follow on from #12516, part of #10582
Compute these intervals on the API server side as EngineCoreOutputs are processed.
The OutputProcessor needs to track per-request arrival times and the generation timestamp of the most recent token generation.
Prefill intervals are calculated when the first EngineCoreOutput is received, indicating prefill completion.
The histogram buckets match the v0 implementation, but they should probably be improved in future.