You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and some special characters to indicate word boundaries like
▁
Ġ
If these special tokens and characters are encountered in the text, they are often not handled correctly. The bert-base-cased tokenizer which uses [SEP] does </s> right with
Huggingface inserts special tokens, notably
and some special characters to indicate word boundaries like
If these special tokens and characters are encountered in the text, they are often not handled correctly. The
bert-base-cased
tokenizer which uses[SEP]
does</s>
right withbut
roberrta-base
findsin which it confuses the text with the imaginary mark-up tokens. However,
bert-base-cased
will get confused with[SEP]
This is already problematic with tokenization via Python, but the Rust answers can differ. With
roberta-base
there is a Python result for<s>
and a Rust result
These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.
It looks like huggingface is using a single channel for the data and not escaping properly.
The text was updated successfully, but these errors were encountered: