Access special token params for tiktoken #923

jamescalam · 2023-02-07T08:19:22Z

If a text being tokenizer by tiktoken contains a special token like <|endoftext|>, we will see the error:

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

But we cannot access the disallowed_special or allowed_special params via langchain.

Here's a colab demoing the above: https://colab.research.google.com/drive/18S7AH2K64vymFA-Obeqp_O-1LwFn3i3Q?usp=sharing

Submitting a PR

The text was updated successfully, but these errors were encountered:

The PR allows for `allowed_special` and `disallowed_special` parameters to be used (see issue #923 ). The default parameters for these are `set()` and `"all"` respectively [as per the code](https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L74). The reason this is needed is because when a GPT special token appears in some text to be encoded, an error will be raised (see issue #923 ) - using these special token params is the only way to get around it. Also added the same functionality for the `TokenTextSplitter`, so now this will work: ```python from langchain.text_splitter import TokenTextSplitter text_splitter = TokenTextSplitter.from_tiktoken_encoder( encoding_name=encoder_name, chunk_size=300, chunk_overlap=50 ) text_splitter.split_text( some_text, disallowed_special=() ) ```

hwchase17 · 2023-02-11T07:19:55Z

done!

VGPS · 2023-05-04T01:54:32Z

still not working.....

SimonB97 · 2023-05-22T11:32:48Z

same here, i am getting a TypeError:
TypeError: split_text() got an unexpected keyword argument 'disallowed_special'

my code:

llm = OpenAI(temperature=0)
text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name='gpt2',
    chunk_size=300,
    chunk_overlap=50,
)
texts = text_splitter.split_text(documents, disallowed_special=())

kraj011 · 2023-06-19T22:23:02Z

From: shashnkvats/PdfPal#1 (comment)
"Adding disallowed_special=() parameter to OpenAIEmbeddings() function fixes it."
This ended up working for me :D

Aiksyuan · 2023-07-17T08:23:08Z

"OpenAIEmbeddings()" is to be replaced with "OpenAIEmbeddings(disallowed_special=())"

panruotong · 2024-06-03T14:20:55Z

text_splitter = CharacterTextSplitter.from_tiktoken_encoder( encoding_name="cl100k_base", chunk_size=500, chunk_overlap=0, disallowed_special=() )
This works for me.

jamescalam mentioned this issue Feb 7, 2023

Tiktoken special params #924

Merged

hwchase17 closed this as completed Feb 11, 2023

slavakurilyak mentioned this issue Apr 4, 2023

Error during traversal: The text contains a special token that is not allowed context-labs/autodoc#22

Open

shibanovp mentioned this issue Apr 10, 2023

OpenAIEmbeddings special token params for tiktoken #2681

Closed

shashnkvats mentioned this issue May 2, 2023

Code is not working as expected shashnkvats/PdfPal#1

Open

flash1293 mentioned this issue Dec 4, 2023

Vector DB CDK: Fix special tokens airbytehq/airbyte#33065

Merged

jamesvillarrubia mentioned this issue Jan 29, 2024

TikToken Special Character Conflict eli64s/readme-ai#88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access special token params for tiktoken #923

Access special token params for tiktoken #923

jamescalam commented Feb 7, 2023

hwchase17 commented Feb 11, 2023

VGPS commented May 4, 2023

SimonB97 commented May 22, 2023

kraj011 commented Jun 19, 2023

Aiksyuan commented Jul 17, 2023

panruotong commented Jun 3, 2024

Access special token params for tiktoken #923

Access special token params for tiktoken #923

Comments

jamescalam commented Feb 7, 2023

hwchase17 commented Feb 11, 2023

VGPS commented May 4, 2023

SimonB97 commented May 22, 2023

kraj011 commented Jun 19, 2023

Aiksyuan commented Jul 17, 2023

panruotong commented Jun 3, 2024