-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to use better tokenization #9
Comments
I got that working in a notebook: from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print(len(tokenizer("Hello world! Is this mo
<img width="802" alt="image" src="https://user-images.githubusercontent.com/9599/212561369-7a86d32d-97da-4309-87ae-5a8f50bb1da0.png">
re accurate?")["input_ids"]))
# Output 8 There's one big catch: the first time I called |
So... maybe make this an optional dependency. Or figure out how to bundle the files that were downloaded there into the PyPI package? |
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"gpt2": "https://huggingface.co/gpt2/resolve/main/vocab.json",
"gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/vocab.json",
"gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/vocab.json",
"gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/vocab.json",
"distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/vocab.json",
},
"merges_file": {
"gpt2": "https://huggingface.co/gpt2/resolve/main/merges.txt",
"gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/merges.txt",
"gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/merges.txt",
"gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/merges.txt",
"distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/merges.txt",
},
"tokenizer_file": {
"gpt2": "https://huggingface.co/gpt2/resolve/main/tokenizer.json",
"gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json",
"gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/tokenizer.json",
"gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json",
"distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/tokenizer.json",
},
} |
I think this might be better - and may be a simple enough dependency that I can use it by default: https://github.com/openai/tiktoken |
Related:
I think this is because my current tokenization code is this:
datasette-openai/datasette_openai/__init__.py
Lines 10 to 21 in 289dad1
But that's actually not exactly accurate. The https://beta.openai.com/tokenizer suggests using this instead: https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast
The text was updated successfully, but these errors were encountered: