Special tokens in text aren't escaped properly to huggingface

Huggingface inserts special tokens, notably
```
<s>
</s>
[CLS]
[SEP]
[UNK]
 ```
and some special characters to indicate word boundaries like
```
▁
Ġ
```
If these special tokens and characters are encountered in the text, they are often not handled correctly.  The `bert-base-cased` tokenizer which uses `[SEP]` does `</s>` right with
```
['[CLS]', '<', '/', 's', '>', '[SEP]']
[101, 133, 120, 188, 135, 102]
```
but `roberrta-base` finds
```
['<s>', '</s>', '</s>']
[0, 2, 2]
```
in which it confuses the text with the imaginary mark-up tokens.  However, `bert-base-cased` will get confused with `[SEP]`
```
['[CLS]', '[SEP]', '[SEP]']
[101, 102, 102]
```
This is already problematic with tokenization via Python, but the Rust answers can differ.  With `roberta-base` there is a Python result for `<s>`
```
['<s>', '<s>', '</s>']
[0, 0, 2]
```
and a Rust result
```
['<s>', 'Ġ', '<s>', '</s>']
[0, 1437, 0, 2]
```
These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.

It looks like huggingface is using a single channel for the data and not escaping properly.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Special tokens in text aren't escaped properly to huggingface #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Special tokens in text aren't escaped properly to huggingface #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions