Skip to content

Special tokens in text aren't escaped properly to huggingface #33

Open
@kwalcock

Description

@kwalcock

Huggingface inserts special tokens, notably

<s>
</s>
[CLS]
[SEP]
[UNK]

and some special characters to indicate word boundaries like

▁
Ġ

If these special tokens and characters are encountered in the text, they are often not handled correctly. The bert-base-cased tokenizer which uses [SEP] does </s> right with

['[CLS]', '<', '/', 's', '>', '[SEP]']
[101, 133, 120, 188, 135, 102]

but roberrta-base finds

['<s>', '</s>', '</s>']
[0, 2, 2]

in which it confuses the text with the imaginary mark-up tokens. However, bert-base-cased will get confused with [SEP]

['[CLS]', '[SEP]', '[SEP]']
[101, 102, 102]

This is already problematic with tokenization via Python, but the Rust answers can differ. With roberta-base there is a Python result for <s>

['<s>', '<s>', '</s>']
[0, 0, 2]

and a Rust result

['<s>', 'Ġ', '<s>', '</s>']
[0, 1437, 0, 2]

These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.

It looks like huggingface is using a single channel for the data and not escaping properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions