Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access special token params for tiktoken #923

Closed
jamescalam opened this issue Feb 7, 2023 · 6 comments
Closed

Access special token params for tiktoken #923

jamescalam opened this issue Feb 7, 2023 · 6 comments

Comments

@jamescalam
Copy link
Contributor

If a text being tokenizer by tiktoken contains a special token like <|endoftext|>, we will see the error:

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

But we cannot access the disallowed_special or allowed_special params via langchain.

Here's a colab demoing the above: https://colab.research.google.com/drive/18S7AH2K64vymFA-Obeqp_O-1LwFn3i3Q?usp=sharing

Submitting a PR

hwchase17 pushed a commit that referenced this issue Feb 10, 2023
The PR allows for `allowed_special` and `disallowed_special` parameters
to be used (see issue #923 ). The default parameters for these are
`set()` and `"all"` respectively [as per the
code](https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L74).

The reason this is needed is because when a GPT special token appears in
some text to be encoded, an error will be raised (see issue #923 ) -
using these special token params is the only way to get around it.

Also added the same functionality for the `TokenTextSplitter`, so now
this will work:

```python
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name=encoder_name,
    chunk_size=300,
    chunk_overlap=50
)
text_splitter.split_text(
    some_text, 
    disallowed_special=()
)
```
@hwchase17
Copy link
Contributor

done!

@VGPS
Copy link

VGPS commented May 4, 2023

still not working.....

@SimonB97
Copy link

same here, i am getting a TypeError:
TypeError: split_text() got an unexpected keyword argument 'disallowed_special'

my code:

llm = OpenAI(temperature=0)
text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name='gpt2',
    chunk_size=300,
    chunk_overlap=50,
)
texts = text_splitter.split_text(documents, disallowed_special=())

@kraj011
Copy link

kraj011 commented Jun 19, 2023

From: shashnkvats/PdfPal#1 (comment)
"Adding disallowed_special=() parameter to OpenAIEmbeddings() function fixes it."
This ended up working for me :D

@Aiksyuan
Copy link

"OpenAIEmbeddings()" is to be replaced with "OpenAIEmbeddings(disallowed_special=())"

@panruotong
Copy link

text_splitter = CharacterTextSplitter.from_tiktoken_encoder( encoding_name="cl100k_base", chunk_size=500, chunk_overlap=0, disallowed_special=() )
This works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants