-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM #12518
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Thanks for cleaning this up, can you update the relevant CLI args to have a similar interface as tool parsing? |
@DarkLight1337 Sure, I would be happy to. just to make sure I understand correctly, which CLI interface you are referring to? (This PR does not require change of external interface. It only add a new type in tokenizer_mode arg) |
This pull request has merge conflicts that must be resolved before it can be |
I mean you should update the help text inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, can we create some unittest to check the custom tokenizer logic?
Signed-off-by: Keyun Tong <[email protected]>
Signed-off-by: Keyun Tong <[email protected]>
Signed-off-by: Keyun Tong <[email protected]>
Signed-off-by: Keyun Tong <[email protected]>
Signed-off-by: Keyun Tong <[email protected]>
Signed-off-by: Keyun Tong <[email protected]>
Signed-off-by: Keyun Tong <[email protected]>
@property | ||
def sep_token(self) -> str: | ||
raise NotImplementedError() | ||
|
||
@property | ||
def pad_token(self) -> str: | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get this by decoding the token IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 There is no official corresponding sep_token and pad token in mistral tokenizer. Also decoding special token id is not supported in the mistral tokenizer at the moment.
Since this these two methods in MistralTokenizer is actually not used throughout the code base, maybe we leave it as NotImplemented until we have the need for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's leave this as is then.
To allow customized on-demand registration of tokenizer, this diff is to propose a tokenizer registry in vllm, similar to model registry.