Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if split_special_tokens==True,fast_tokenizer is slower than slow_tokenizer #1700

Open
gongel opened this issue Dec 12, 2024 · 1 comment
Open

Comments

@gongel
Copy link

gongel commented Dec 12, 2024

from transformers import LlamaTokenizer, LlamaTokenizerFast
import time
tokenizer1 = LlamaTokenizer.from_pretrained("./Llama-2-7b-chat-hf", split_special_tokens=True) # LlamaTokenizer
tokenizer2 = LlamaTokenizerFast.from_pretrained("./Llama-2-7b-chat-hf", split_special_tokens=True) # LlamaTokenizer
print(tokenizer1, tokenizer2)

s_time = time.time()
for i in range(1000):
    tokenizer1.tokenize("你好,where are you?"*100)
print(f"slow: {time.time() - s_time}")

s_time = time.time()
for i in range(1000):
    tokenizer2.tokenize("你好,where are you?"*100)
print(f"fast: {time.time() - s_time}")

output:
slow: 0.6021890640258789
fast: 0.7353882789611816

@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2025

If I use * 1000 instead of * 100 this is what I get on my small machine:

slow: 7.805477857589722
fast: 7.280818223953247

In general we don't look too heavily into micro benchmarks (unless it's a 10x), They don't usually tell a super compelling story.
For instance you could be using batch tokenization which should be much faster on Fast here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants