feature: triton generate support #675

nnshah1 · 2024-05-24T17:29:37Z

This PR does two main things:

Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".
Remove "service-kind" parameter from top level cli. Service kind can be inferred from endpoint-type and endpoint-type is more clear. endpoint-type is tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based on endpoint-type. To facillitate a new endpoint-type of kserve was added.

Existing Tests have been updated.

No new tests added - could be done - or done as separate PR.

Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.

debermudez

This seems surprisingly lightweight and straightforward.

I think the only piece really missing is an update to the unit tests.
I would also like to set this up to test it e2e too.

src/c++/perf_analyzer/client_backend/openai/openai_client.cc

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py

…#642)" This reverts commit cc6a3b2.

tgerdesnv

I'd really like to see an E2E test protecting this. Not sure if the CI is ready for that yet, but a high priority ticket should be created at a minimum.

tgerdesnv · 2024-06-05T13:47:51Z

src/c++/perf_analyzer/genai-perf/genai_perf/parser.py

@@ -137,7 +142,7 @@ def _check_conditional_args(
    if args.service_kind != "triton":
        if args.output_tokens_mean_deterministic:
            parser.error(
-                "The --output-tokens-mean-deterministic option is only supported with the Triton service-kind."
+                "The --output-tokens-mean-deterministic option is only supported with the kserve endpoint type."


Should the code be changed to check endpoint_type != kserve? I know that with the current code it is the same result, but it introduces an assumption (endpoint kserve -> service_kind triton) that could trip up a future developer.

tgerdesnv · 2024-06-05T13:51:52Z

src/c++/perf_analyzer/genai-perf/genai_perf/parser.py

        default="tensorrtllm",
        required=False,
-        help=f'When using the "triton" service-kind, '
+        help=f'When using the "kserve" endpoint type, '


Can generate endpoint not use trtllm vs vllm?

It can - I haven't added any different behavior for the different backends. Actually - it has only been tested against vllm at the moment. So this is fair point ...

Let me move this back to draft - plan to test trt-llm in the next week or so

src/c++/perf_analyzer/genai-perf/genai_perf/parser.py

tgerdesnv · 2024-06-05T13:55:09Z

src/c++/perf_analyzer/client_backend/openai/openai_client.cc

@@ -63,6 +63,14 @@ namespace openai {
 void
 ChatCompletionRequest::SendResponse(bool is_final, bool is_null)
 {
+  // if final response has already been sent


The classes in this file should be renamed since they aren't specific to Chat Completions.

I think "HTTP with SSE Support" is in the end what it is .... not sure the best name.

I'd really like to see the classes refactored. We shouldn't need two independent full http clients. Either one goes away, or we get a base class and then some really thin implementation classes on top. We already have stories for this (TMA-1644), so no big deal if this is ignored in this PR.

nv-hwoo · 2024-06-05T16:44:58Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_metrics.py

@@ -731,7 +748,10 @@ def _extract_openai_text_output(self, response: str) -> str:

    def _is_openai_empty_response(self, response: str) -> bool:


We should change the name of the function since it's no longer just openai

nv-hwoo · 2024-06-05T16:51:42Z

src/c++/perf_analyzer/genai-perf/genai_perf/parser.py

-    )
-
-    endpoint_group.add_argument(
-        "--service-kind",


Finally one less CLI option 😄 Can we also update the README to reflect the changes in CLI options?

src/c++/perf_analyzer/genai-perf/genai_perf/parser.py

…erate

matthewkotila · 2024-08-06T21:02:36Z

.pre-commit-config.yaml

@nnshah1 what's blocking this PR as being marked ready for review?

Its out of date from some of our other work now and needs to be ported to the new repository at the bare minimum. I really like the additions here so I would like to see them integrated soon as well.

nnshah1 marked this pull request as draft May 24, 2024 17:29

nnshah1 requested review from debermudez and matthewkotila May 24, 2024 17:29

nnshah1 changed the base branch from main to r24.05 May 24, 2024 17:30

debermudez reviewed May 24, 2024

View reviewed changes

src/c++/perf_analyzer/client_backend/openai/openai_client.cc Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems May 25, 2024

View reviewed changes

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py Fixed Show fixed Hide fixed

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py Fixed Show fixed Hide fixed

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py Fixed Show fixed Hide fixed

nnshah1 marked this pull request as ready for review June 4, 2024 23:15

nnshah1 marked this pull request as draft June 4, 2024 23:25

nnshah1 marked this pull request as ready for review June 5, 2024 03:59

nnshah1 changed the title ~~triton generate support~~ feature: triton generate support Jun 5, 2024

nnshah1 changed the base branch from r24.05 to main June 5, 2024 04:00

ganeshku1 and others added 9 commits June 4, 2024 21:05

Revert "Changes to support Ensemble Top Level Response Caching (#560) (…

c822917

…#642)" This reverts commit cc6a3b2.

Changes to support Ensemble Top Level Response Caching (#560)

fddba6d

initial generate support

61ae566

removing service type - infer from endpoint type

f1b4646

updated to remove unused variables

56e662c

update with precommit run

5503c85

updating tests

bfc1c7a

updating tests

d576d25

updated test to mark as potential failure

1f50b6e

nnshah1 force-pushed the nnshah1-triton-generate branch from af8c734 to 1f50b6e Compare June 5, 2024 04:10

nnshah1 added 4 commits June 4, 2024 21:12

removing

8dcc53d

updating comment

7f740a9

undo changes

73ffa08

removing unneeded change

056c099

nnshah1 requested review from debermudez and tgerdesnv June 5, 2024 04:49

nnshah1 added 2 commits June 4, 2024 22:10

updates for new model selection

e7b3e53

updating expected values

0c7f5a1

tgerdesnv reviewed Jun 5, 2024

View reviewed changes

nnshah1 marked this pull request as draft June 5, 2024 14:11

nv-hwoo reviewed Jun 5, 2024

View reviewed changes

debermudez reviewed Jun 11, 2024

View reviewed changes

src/c++/perf_analyzer/genai-perf/genai_perf/parser.py Show resolved Hide resolved

nnshah1 added 4 commits July 10, 2024 21:37

Changes to indicate final response on DONE or http request completion

1026361

updated comment to clarify behavior

20a3c07

changing to be more explicit

69480bd

Merge branch 'nnshah1-openai-finish-on-close' into nnshah1-triton-gen…

9f153f8

…erate

matthewkotila reviewed Aug 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: triton generate support #675

feature: triton generate support #675

nnshah1 commented May 24, 2024 •

edited

Loading

debermudez left a comment

tgerdesnv left a comment

tgerdesnv Jun 5, 2024

tgerdesnv Jun 5, 2024

nnshah1 Jun 5, 2024 •

edited

Loading

tgerdesnv Jun 5, 2024

nnshah1 Jun 5, 2024

tgerdesnv Jun 5, 2024

nv-hwoo Jun 5, 2024

nv-hwoo Jun 5, 2024

matthewkotila Aug 6, 2024

debermudez Aug 6, 2024

		@@ -731,7 +748,10 @@ def _extract_openai_text_output(self, response: str) -> str:

		def _is_openai_empty_response(self, response: str) -> bool:

feature: triton generate support #675

Are you sure you want to change the base?

feature: triton generate support #675

Conversation

nnshah1 commented May 24, 2024 • edited Loading

debermudez left a comment

Choose a reason for hiding this comment

tgerdesnv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnshah1 Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnshah1 commented May 24, 2024 •

edited

Loading

nnshah1 Jun 5, 2024 •

edited

Loading