-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support sending additional outputs from vLLM inference #70
base: main
Are you sure you want to change the base?
Conversation
* [WIP] Add additional outputs to auto complete * [WIP] Use individual input tensor to control per additional output * [WIP] Parse additional output flags from request
264d387
to
9fc7d0b
Compare
* Add additional outputs test * Update copyright * Some test enhancement and notes
9fc7d0b
to
5e605ca
Compare
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
--> | ||
|
||
# Additional Outputs from vLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent documentation and concise reference from top-level README!
# TODO: vLLM may return token ids identical to the previous one when | ||
# streaming, for example: | ||
# | ||
# prev: None | ||
# curr: text=' the', token_ids=array('l', [5]) | ||
# | ||
# prev: text=' the', token_ids=array('l', [5, 1385]) | ||
# curr: text=' the term', token_ids=array('l', [5, 1385]) | ||
# | ||
# prev: text=' the term', token_ids=array('l', [5, 1385, 44]) | ||
# curr: text=' the term', token_ids=array('l', [5, 1385, 44]) | ||
# | ||
# prev: text=' the term', token_ids=array('l', [5, 1385, 44, 48]) | ||
# curr: text=' the term “', token_ids=array('l', [5, 1385, 44, 48]) | ||
# | ||
# If this is no longer the case in a future release, change the assert | ||
# to assert num_token_ids > 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally following this section. Can you show me an example output from calling /generate_stream
on a vllm model with output_num_token_ids=True
and stream=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 0.5.3.post1:
text: ' the'
num_token_ids: 1
-----
text: ' term'
num_token_ids: 1
-----
text: ''
num_token_ids: 1
-----
text: ' “'
num_token_ids: 1
-----
text: 'p'
num_token_ids: 1
-----
text: 'olar'
num_token_ids: 1
-----
text: 'ized'
num_token_ids: 1
-----
text: ''
num_token_ids: 1
-----
text: '”'
num_token_ids: 1
-----
text: ' is'
num_token_ids: 1
-----
text: ' used'
num_token_ids: 1
-----
text: ' to'
num_token_ids: 1
-----
text: ' refer'
num_token_ids: 1
-----
text: ' to'
num_token_ids: 1
-----
text: ' a'
num_token_ids: 1
-----
text: ' polar'
num_token_ids: 1
-----
which is expected, but on 0.5.5:
text: ' the'
num_token_ids: 1
-----
text: ' term'
num_token_ids: 0
-----
text: ''
num_token_ids: 0
-----
text: ' “'
num_token_ids: 0
-----
text: 'p'
num_token_ids: 0
-----
text: 'olar'
num_token_ids: 0
-----
text: 'ized'
num_token_ids: 0
-----
text: ''
num_token_ids: 0
-----
text: '”'
num_token_ids: 0
-----
text: ' is'
num_token_ids: 0
-----
text: ' used'
num_token_ids: 0
-----
text: ' to'
num_token_ids: 0
-----
text: ' refer'
num_token_ids: 0
-----
text: ' to'
num_token_ids: 0
-----
text: ' a'
num_token_ids: 0
-----
text: ' polar'
num_token_ids: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears the previous output (the token_ids
field) is overwritten by the current output, when the engine is streaming outputs for a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue appears fixed on a later release, i.e. 0.6.3.post1:
text: ' the'
num_token_ids: 1
-----
text: ' term'
num_token_ids: 1
-----
text: ''
num_token_ids: 1
-----
text: ' “'
num_token_ids: 1
-----
text: 'p'
num_token_ids: 1
-----
text: 'olar'
num_token_ids: 1
-----
text: 'ized'
num_token_ids: 1
-----
text: ''
num_token_ids: 1
-----
text: '”'
num_token_ids: 1
-----
text: ' is'
num_token_ids: 1
-----
text: ' used'
num_token_ids: 1
-----
text: ' to'
num_token_ids: 1
-----
text: ' refer'
num_token_ids: 1
-----
text: ' to'
num_token_ids: 1
-----
text: ' a'
num_token_ids: 1
-----
text: ' polar'
num_token_ids: 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think num_token_ids
could probably be better named to reflect that it is the output
tokens. OpenAI APIs return information about both input
(prompt/context) and output
(decode, generation) tokens - so we should leave room for that with clear naming, even if we only implement the output
tokens in this PR.
How about, num_output_tokens
, and in future if requested, num_input_tokens
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original ask is for "number of tokens", but I agree that we could return token_ids
instead, because from a feature perspective we are making the "return_*" space less crowded if the outputting token_ids
would be requested in the future, in addition to num_token_ids
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry edited my responses above. While returning token_ids
is probably more generalized, for large sequences it could actually be more costly than returning a single number if all they want is the count (for non-streaming) - and would introduce some postprocessing logic requirement if all they want is the count, so I removed that part for now. I'm a little torn on it.
We can probably just stick to the ask of returning the token counts, and consider either verbose logging (debugging) or further outputs for actual token_ids if requested later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return token ids instead of number of token ids
The model.py on the CI pipeline has been replaced with the version on this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can just add token_ids
if requested later. Will update the name num_token_ids
to num_output_tokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename num_token_ids to num_output_tokens
The model.py on the CI pipeline has been replaced with the version on this commit.
docs/additional_outputs.md
Outdated
[here](https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L26) | ||
for more details. | ||
|
||
To enable, set `output_finish_reason` input tensor to `True`. The reason will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - In TRT-LLM backend, the input tensors to enable optional outputs are named with "return_XXX" convention. I don't think we must be aligned with the naming though, I'd prefer to use what vLLM users are used to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is our first attempt to send optional outputs on vLLM, we can name the input tensor switches to "return_*". @rmccorm4 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model.py on the CI pipeline has been replaced with the version on this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of being as similar across both backends as possible - unless there's a clear framework-specific idiom like Kris mentioned (ex: something vllm users are already comfortable with).
What does the PR do?
Add support for sending additional outputs from vLLM. At this step, the following 3 outputs are added:
Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
N/A
Where should the reviewer start?
N/A
Test plan:
A new test is added with this PR for all combinations on setting the 3 additional outputs and verifying the outputs are valid for each combination.
Caveats:
N/A
Background
Outputs supported by vLLM in addition to text output: https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py#L14-L40
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A