Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the difference about the bbpe vocab decode method in minbpe against huggingface transformers? #15

Open
lovekittynine opened this issue Feb 19, 2024 · 0 comments

Comments

@lovekittynine
Copy link

Thanks for your nice work! I have a question after reading basic.py, and I want to figure out why...

  • In the save function implementation of basic.py, the BBPE vocab is saved through the decode() method. However, many tokens cannot be decoded into valid strings, so they are replaced with '�'.

  • But in HuggingFace Transformers, the vocab file of the BBPE tokenizer may not be decoded, such as in the BloomZ model, whose tokenizer adopts the BBPE method. If we have a Chinese string, string = "我爱中国!", the result of tokenization is ['æĪijçĪ±', 'ä¸Ńåįİ', 'ï¼ģ']. It is obvious that the vocab of the BloomZ tokenizer has tokens like ‘æĪijçĪ±’, ‘ä¸Ńåįİ’, ‘ï¼ģ’. The encoding of the string is b"\xe6\x88\x91\xe7\x88\xb1\xe4\xb8\xad\xe5\x9b\xbd\xef\xbc\x81". Actually, the first token 'æĪijçĪ±' corresponds to the byte sequence b'\xe6\x88\x91\xe7\x88\xb1', which should decode to "我爱". I guess the implementation in Transformers might be the per-byte mapping through the chr method, and I have tried it out. list(b'\xe6\x88\x91\xe7\x88\xb1') = [230, 136, 145, 231, 136, 177], and mapping it with the chr method, I get ['æ', '\x88', '\x91', 'ç', '\x88', '±'], but some tokens are not identical. For example, 'Ī' is not the same as '\x88'. There may be some mapping rules for the chr results, which are still not valid characters, like '\x88'.

Do you know why this is? Thanks again~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant