-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tiktoken special params #924
Tiktoken special params #924
Conversation
@hwchase17 possible to get some feedback on this? Let me know if you prefer it to be implemented another way and you can run the code that this will fix here https://colab.research.google.com/drive/18S7AH2K64vymFA-Obeqp_O-1LwFn3i3Q?usp=sharing |
def split_text( | ||
self, | ||
text: str, | ||
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on putting this in the init, eg self.allowed_special
etc?
would be more consistent with other splitters (taking all the params at init rather than run time)
if you agree, im also happy to make this change myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey Harrison, yes this makes sense, please feel free to make the change, thanks!
The PR allows for
allowed_special
anddisallowed_special
parameters to be used (see issue #923 ). The default parameters for these areset()
and"all"
respectively as per the code.The reason this is needed is because when a GPT special token appears in some text to be encoded, an error will be raised (see issue #923 ) - using these special token params is the only way to get around it.
Also added the same functionality for the
TokenTextSplitter
, so now this will work: