Option to use better tokenization #9

simonw · 2023-01-15T18:56:47Z

openai_build_prompt() generating too long a prompt #8

I think this is because my current tokenization code is this:

datasette-openai/datasette_openai/__init__.py

Lines 10 to 21 in 289dad1

    
           # From https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53 
        
           token_re = regex.compile( 
        
               r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" 
        
           ) 
        
           def tokenize(text): 
        
               return [t.strip() for t in token_re.findall(text)] 
        
           def count_tokens(text): 
        
               return len(tokenize(text))

But that's actually not exactly accurate. The https://beta.openai.com/tokenizer suggests using this instead: https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast

simonw · 2023-01-15T18:59:02Z

I got that working in a notebook:

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print(len(tokenizer("Hello world! Is this mo
<img width="802" alt="image" src="https://user-images.githubusercontent.com/9599/212561369-7a86d32d-97da-4309-87ae-5a8f50bb1da0.png">
re accurate?")["input_ids"]))
# Output 8

There's one big catch: the first time I called GPT2TokenizerFast.from_pretrained("gpt2") it downloaded a bunch of data from somewhere:

simonw · 2023-01-15T19:06:59Z

So... maybe make this an optional dependency.

Or figure out how to bundle the files that were downloaded there into the PyPI package?

simonw · 2023-01-15T19:17:58Z

Relevant code: https://github.com/huggingface/transformers/blob/5db9abde439bc02c3791da2a4fefee80d94d5b96/src/transformers/models/gpt2/tokenization_gpt2_fast.py#L37-L59

PRETRAINED_VOCAB_FILES_MAP = {
    "vocab_file": {
        "gpt2": "https://huggingface.co/gpt2/resolve/main/vocab.json",
        "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/vocab.json",
        "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/vocab.json",
        "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/vocab.json",
        "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/vocab.json",
    },
    "merges_file": {
        "gpt2": "https://huggingface.co/gpt2/resolve/main/merges.txt",
        "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/merges.txt",
        "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/merges.txt",
        "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/merges.txt",
        "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/merges.txt",
    },
    "tokenizer_file": {
        "gpt2": "https://huggingface.co/gpt2/resolve/main/tokenizer.json",
        "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json",
        "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/tokenizer.json",
        "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json",
        "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/tokenizer.json",
    },
}

simonw · 2023-01-18T13:44:05Z

I think this might be better - and may be a simple enough dependency that I can use it by default: https://github.com/openai/tiktoken

simonw added the enhancement New feature or request label Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to use better tokenization #9

Option to use better tokenization #9

simonw commented Jan 15, 2023

simonw commented Jan 15, 2023

simonw commented Jan 15, 2023

simonw commented Jan 15, 2023

simonw commented Jan 18, 2023

Option to use better tokenization #9

Option to use better tokenization #9

Comments

simonw commented Jan 15, 2023

simonw commented Jan 15, 2023

simonw commented Jan 15, 2023

simonw commented Jan 15, 2023

simonw commented Jan 18, 2023