Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support sending additional outputs from vLLM inference #70

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Nov 2, 2024

What does the PR do?

Add support for sending additional outputs from vLLM. At this step, the following 3 outputs are added:

  • finish reason
  • cumulative log probabilities
  • number of token ids

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

N/A

Where should the reviewer start?

N/A

Test plan:

A new test is added with this PR for all combinations on setting the 3 additional outputs and verifying the outputs are valid for each combination.

  • CI Pipeline ID: 20077450

Caveats:

N/A

Background

Outputs supported by vLLM in addition to text output: https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L14-L40

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

* [WIP] Add additional outputs to auto complete

* [WIP] Use individual input tensor to control per additional output

* [WIP] Parse additional output flags from request
@kthui kthui force-pushed the jacky-vllm-additional-outputs branch from 264d387 to 9fc7d0b Compare November 2, 2024 00:21
@kthui kthui self-assigned this Nov 2, 2024
* Add additional outputs test

* Update copyright

* Some test enhancement and notes
@kthui kthui force-pushed the jacky-vllm-additional-outputs branch from 9fc7d0b to 5e605ca Compare November 2, 2024 04:04
@kthui kthui added the PR: feat A new feature label Nov 4, 2024
@kthui kthui changed the title Support sending additional outputs from vLLM inference feat: Support sending additional outputs from vLLM inference Nov 4, 2024
@kthui kthui marked this pull request as ready for review November 4, 2024 22:53
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Additional Outputs from vLLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent documentation and concise reference from top-level README!

Comment on lines 143 to 159
# TODO: vLLM may return token ids identical to the previous one when
# streaming, for example:
#
# prev: None
# curr: text=' the', token_ids=array('l', [5])
#
# prev: text=' the', token_ids=array('l', [5, 1385])
# curr: text=' the term', token_ids=array('l', [5, 1385])
#
# prev: text=' the term', token_ids=array('l', [5, 1385, 44])
# curr: text=' the term', token_ids=array('l', [5, 1385, 44])
#
# prev: text=' the term', token_ids=array('l', [5, 1385, 44, 48])
# curr: text=' the term “', token_ids=array('l', [5, 1385, 44, 48])
#
# If this is no longer the case in a future release, change the assert
# to assert num_token_ids > 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally following this section. Can you show me an example output from calling /generate_stream on a vllm model with output_num_token_ids=True and stream=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 0.5.3.post1:

         text: ' the'
num_token_ids: 1
-----
         text: ' term'
num_token_ids: 1
-----
         text: ''
num_token_ids: 1
-----
         text: ' “'
num_token_ids: 1
-----
         text: 'p'
num_token_ids: 1
-----
         text: 'olar'
num_token_ids: 1
-----
         text: 'ized'
num_token_ids: 1
-----
         text: ''
num_token_ids: 1
-----
         text: '”'
num_token_ids: 1
-----
         text: ' is'
num_token_ids: 1
-----
         text: ' used'
num_token_ids: 1
-----
         text: ' to'
num_token_ids: 1
-----
         text: ' refer'
num_token_ids: 1
-----
         text: ' to'
num_token_ids: 1
-----
         text: ' a'
num_token_ids: 1
-----
         text: ' polar'
num_token_ids: 1
-----

which is expected, but on 0.5.5:

         text: ' the'
num_token_ids: 1
-----
         text: ' term'
num_token_ids: 0
-----
         text: ''
num_token_ids: 0
-----
         text: ' “'
num_token_ids: 0
-----
         text: 'p'
num_token_ids: 0
-----
         text: 'olar'
num_token_ids: 0
-----
         text: 'ized'
num_token_ids: 0
-----
         text: ''
num_token_ids: 0
-----
         text: '”'
num_token_ids: 0
-----
         text: ' is'
num_token_ids: 0
-----
         text: ' used'
num_token_ids: 0
-----
         text: ' to'
num_token_ids: 0
-----
         text: ' refer'
num_token_ids: 0
-----
         text: ' to'
num_token_ids: 0
-----
         text: ' a'
num_token_ids: 0
-----
         text: ' polar'
num_token_ids: 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears the previous output (the token_ids field) is overwritten by the current output, when the engine is streaming outputs for a request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue appears fixed on a later release, i.e. 0.6.3.post1:

         text: ' the'
num_token_ids: 1
-----
         text: ' term'
num_token_ids: 1
-----
         text: ''
num_token_ids: 1
-----
         text: ' “'
num_token_ids: 1
-----
         text: 'p'
num_token_ids: 1
-----
         text: 'olar'
num_token_ids: 1
-----
         text: 'ized'
num_token_ids: 1
-----
         text: ''
num_token_ids: 1
-----
         text: '”'
num_token_ids: 1
-----
         text: ' is'
num_token_ids: 1
-----
         text: ' used'
num_token_ids: 1
-----
         text: ' to'
num_token_ids: 1
-----
         text: ' refer'
num_token_ids: 1
-----
         text: ' to'
num_token_ids: 1
-----
         text: ' a'
num_token_ids: 1
-----
         text: ' polar'
num_token_ids: 1

Copy link
Contributor

@rmccorm4 rmccorm4 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think num_token_ids could probably be better named to reflect that it is the output tokens. OpenAI APIs return information about both input (prompt/context) and output (decode, generation) tokens - so we should leave room for that with clear naming, even if we only implement the output tokens in this PR.

How about, num_output_tokens, and in future if requested, num_input_tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original ask is for "number of tokens", but I agree that we could return token_ids instead, because from a feature perspective we are making the "return_*" space less crowded if the outputting token_ids would be requested in the future, in addition to num_token_ids.

Copy link
Contributor

@rmccorm4 rmccorm4 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry edited my responses above. While returning token_ids is probably more generalized, for large sequences it could actually be more costly than returning a single number if all they want is the count (for non-streaming) - and would introduce some postprocessing logic requirement if all they want is the count, so I removed that part for now. I'm a little torn on it.

We can probably just stick to the ask of returning the token counts, and consider either verbose logging (debugging) or further outputs for actual token_ids if requested later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return token ids instead of number of token ids

The model.py on the CI pipeline has been replaced with the version on this commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can just add token_ids if requested later. Will update the name num_token_ids to num_output_tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename num_token_ids to num_output_tokens

The model.py on the CI pipeline has been replaced with the version on this commit.

[here](https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L26)
for more details.

To enable, set `output_finish_reason` input tensor to `True`. The reason will be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - In TRT-LLM backend, the input tensors to enable optional outputs are named with "return_XXX" convention. I don't think we must be aligned with the naming though, I'd prefer to use what vLLM users are used to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our first attempt to send optional outputs on vLLM, we can name the input tensor switches to "return_*". @rmccorm4 what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename output_* to return_*

The model.py on the CI pipeline has been replaced with the version on this commit.

Copy link
Contributor

@rmccorm4 rmccorm4 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of being as similar across both backends as possible - unless there's a clear framework-specific idiom like Kris mentioned (ex: something vllm users are already comfortable with).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: feat A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants