Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use better tokenization #9

Open
simonw opened this issue Jan 15, 2023 · 4 comments
Open

Option to use better tokenization #9

simonw opened this issue Jan 15, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jan 15, 2023

Related:

I think this is because my current tokenization code is this:

# From https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
token_re = regex.compile(
r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)
def tokenize(text):
return [t.strip() for t in token_re.findall(text)]
def count_tokens(text):
return len(tokenize(text))

But that's actually not exactly accurate. The https://beta.openai.com/tokenizer suggests using this instead: https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast

@simonw simonw added the enhancement New feature or request label Jan 15, 2023
@simonw
Copy link
Owner Author

simonw commented Jan 15, 2023

I got that working in a notebook:

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print(len(tokenizer("Hello world! Is this mo
<img width="802" alt="image" src="https://user-images.githubusercontent.com/9599/212561369-7a86d32d-97da-4309-87ae-5a8f50bb1da0.png">
re accurate?")["input_ids"]))
# Output 8

There's one big catch: the first time I called GPT2TokenizerFast.from_pretrained("gpt2") it downloaded a bunch of data from somewhere:

image

@simonw
Copy link
Owner Author

simonw commented Jan 15, 2023

So... maybe make this an optional dependency.

Or figure out how to bundle the files that were downloaded there into the PyPI package?

@simonw
Copy link
Owner Author

simonw commented Jan 15, 2023

Relevant code: https://github.com/huggingface/transformers/blob/5db9abde439bc02c3791da2a4fefee80d94d5b96/src/transformers/models/gpt2/tokenization_gpt2_fast.py#L37-L59

PRETRAINED_VOCAB_FILES_MAP = {
    "vocab_file": {
        "gpt2": "https://huggingface.co/gpt2/resolve/main/vocab.json",
        "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/vocab.json",
        "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/vocab.json",
        "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/vocab.json",
        "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/vocab.json",
    },
    "merges_file": {
        "gpt2": "https://huggingface.co/gpt2/resolve/main/merges.txt",
        "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/merges.txt",
        "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/merges.txt",
        "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/merges.txt",
        "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/merges.txt",
    },
    "tokenizer_file": {
        "gpt2": "https://huggingface.co/gpt2/resolve/main/tokenizer.json",
        "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json",
        "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/tokenizer.json",
        "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json",
        "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/tokenizer.json",
    },
}

@simonw
Copy link
Owner Author

simonw commented Jan 18, 2023

I think this might be better - and may be a simple enough dependency that I can use it by default: https://github.com/openai/tiktoken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant