Description
Huggingface inserts special tokens, notably
<s>
</s>
[CLS]
[SEP]
[UNK]
and some special characters to indicate word boundaries like
▁
Ġ
If these special tokens and characters are encountered in the text, they are often not handled correctly. The bert-base-cased
tokenizer which uses [SEP]
does </s>
right with
['[CLS]', '<', '/', 's', '>', '[SEP]']
[101, 133, 120, 188, 135, 102]
but roberrta-base
finds
['<s>', '</s>', '</s>']
[0, 2, 2]
in which it confuses the text with the imaginary mark-up tokens. However, bert-base-cased
will get confused with [SEP]
['[CLS]', '[SEP]', '[SEP]']
[101, 102, 102]
This is already problematic with tokenization via Python, but the Rust answers can differ. With roberta-base
there is a Python result for <s>
['<s>', '<s>', '</s>']
[0, 0, 2]
and a Rust result
['<s>', 'Ġ', '<s>', '</s>']
[0, 1437, 0, 2]
These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.
It looks like huggingface is using a single channel for the data and not escaping properly.