-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: triton generate support #675
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems surprisingly lightweight and straightforward.
I think the only piece really missing is an update to the unit tests.
I would also like to set this up to test it e2e too.
af8c734
to
1f50b6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to see an E2E test protecting this. Not sure if the CI is ready for that yet, but a high priority ticket should be created at a minimum.
@@ -137,7 +142,7 @@ def _check_conditional_args( | |||
if args.service_kind != "triton": | |||
if args.output_tokens_mean_deterministic: | |||
parser.error( | |||
"The --output-tokens-mean-deterministic option is only supported with the Triton service-kind." | |||
"The --output-tokens-mean-deterministic option is only supported with the kserve endpoint type." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the code be changed to check endpoint_type != kserve? I know that with the current code it is the same result, but it introduces an assumption (endpoint kserve -> service_kind triton) that could trip up a future developer.
default="tensorrtllm", | ||
required=False, | ||
help=f'When using the "triton" service-kind, ' | ||
help=f'When using the "kserve" endpoint type, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can generate endpoint not use trtllm vs vllm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can - I haven't added any different behavior for the different backends. Actually - it has only been tested against vllm at the moment. So this is fair point ...
Let me move this back to draft - plan to test trt-llm in the next week or so
@@ -63,6 +63,14 @@ namespace openai { | |||
void | |||
ChatCompletionRequest::SendResponse(bool is_final, bool is_null) | |||
{ | |||
// if final response has already been sent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classes in this file should be renamed since they aren't specific to Chat Completions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "HTTP with SSE Support" is in the end what it is .... not sure the best name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to see the classes refactored. We shouldn't need two independent full http clients. Either one goes away, or we get a base class and then some really thin implementation classes on top. We already have stories for this (TMA-1644), so no big deal if this is ignored in this PR.
@@ -731,7 +748,10 @@ def _extract_openai_text_output(self, response: str) -> str: | |||
|
|||
def _is_openai_empty_response(self, response: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the name of the function since it's no longer just openai
) | ||
|
||
endpoint_group.add_argument( | ||
"--service-kind", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally one less CLI option 😄 Can we also update the README to reflect the changes in CLI options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nnshah1 what's blocking this PR as being marked ready for review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its out of date from some of our other work now and needs to be ported to the new repository at the bare minimum. I really like the additions here so I would like to see them integrated soon as well.
This PR does two main things:
Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".
Remove "service-kind" parameter from top level cli. Service kind can be inferred from
endpoint-type
andendpoint-type
is more clear.endpoint-type
is tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based onendpoint-type
. To facillitate a newendpoint-type
ofkserve
was added.Existing Tests have been updated.
No new tests added - could be done - or done as separate PR.
Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.