Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiktoken special params #924

Merged
merged 7 commits into from
Feb 10, 2023
Merged

Tiktoken special params #924

merged 7 commits into from
Feb 10, 2023

Conversation

jamescalam
Copy link
Contributor

@jamescalam jamescalam commented Feb 7, 2023

The PR allows for allowed_special and disallowed_special parameters to be used (see issue #923 ). The default parameters for these are set() and "all" respectively as per the code.

The reason this is needed is because when a GPT special token appears in some text to be encoded, an error will be raised (see issue #923 ) - using these special token params is the only way to get around it.

Also added the same functionality for the TokenTextSplitter, so now this will work:

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name=encoder_name,
    chunk_size=300,
    chunk_overlap=50
)
text_splitter.split_text(
    some_text, 
    disallowed_special=()
)

@jamescalam
Copy link
Contributor Author

@hwchase17 possible to get some feedback on this? Let me know if you prefer it to be implemented another way and you can run the code that this will fix here https://colab.research.google.com/drive/18S7AH2K64vymFA-Obeqp_O-1LwFn3i3Q?usp=sharing

def split_text(
self,
text: str,
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on putting this in the init, eg self.allowed_special etc?

would be more consistent with other splitters (taking all the params at init rather than run time)

if you agree, im also happy to make this change myself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Harrison, yes this makes sense, please feel free to make the change, thanks!

@hwchase17 hwchase17 changed the base branch from master to harrison/tiktoken-spec February 10, 2023 07:16
@hwchase17 hwchase17 merged commit 9d01ba5 into langchain-ai:harrison/tiktoken-spec Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants