Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special tokens in text aren't escaped properly to huggingface #33

Open
kwalcock opened this issue Jul 26, 2023 · 1 comment
Open

Special tokens in text aren't escaped properly to huggingface #33

kwalcock opened this issue Jul 26, 2023 · 1 comment

Comments

@kwalcock
Copy link
Member

Huggingface inserts special tokens, notably

<s>
</s>
[CLS]
[SEP]
[UNK]

and some special characters to indicate word boundaries like

▁
Ġ

If these special tokens and characters are encountered in the text, they are often not handled correctly. The bert-base-cased tokenizer which uses [SEP] does </s> right with

['[CLS]', '<', '/', 's', '>', '[SEP]']
[101, 133, 120, 188, 135, 102]

but roberrta-base finds

['<s>', '</s>', '</s>']
[0, 2, 2]

in which it confuses the text with the imaginary mark-up tokens. However, bert-base-cased will get confused with [SEP]

['[CLS]', '[SEP]', '[SEP]']
[101, 102, 102]

This is already problematic with tokenization via Python, but the Rust answers can differ. With roberta-base there is a Python result for <s>

['<s>', '<s>', '</s>']
[0, 0, 2]

and a Rust result

['<s>', 'Ġ', '<s>', '</s>']
[0, 1437, 0, 2]

These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.

It looks like huggingface is using a single channel for the data and not escaping properly.

@MihaiSurdeanu
Copy link
Contributor

I am not sure this is important in practice, as these tokens are unlikely to occur naturally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants