-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access special token params for tiktoken #923
Comments
The PR allows for `allowed_special` and `disallowed_special` parameters to be used (see issue #923 ). The default parameters for these are `set()` and `"all"` respectively [as per the code](https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L74). The reason this is needed is because when a GPT special token appears in some text to be encoded, an error will be raised (see issue #923 ) - using these special token params is the only way to get around it. Also added the same functionality for the `TokenTextSplitter`, so now this will work: ```python from langchain.text_splitter import TokenTextSplitter text_splitter = TokenTextSplitter.from_tiktoken_encoder( encoding_name=encoder_name, chunk_size=300, chunk_overlap=50 ) text_splitter.split_text( some_text, disallowed_special=() ) ```
done! |
still not working..... |
same here, i am getting a TypeError: my code:
|
From: shashnkvats/PdfPal#1 (comment) |
"OpenAIEmbeddings()" is to be replaced with "OpenAIEmbeddings(disallowed_special=())" |
|
If a text being tokenizer by tiktoken contains a special token like
<|endoftext|>
, we will see the error:But we cannot access the
disallowed_special
orallowed_special
params via langchain.Here's a colab demoing the above: https://colab.research.google.com/drive/18S7AH2K64vymFA-Obeqp_O-1LwFn3i3Q?usp=sharing
Submitting a PR
The text was updated successfully, but these errors were encountered: