Skip to content

Consistency issue with max_length flag in tokenizer #274

Open
@le1nux

Description

@le1nux

System Info

all versions

🐛 Describe the bug

Currently, we set the max_length attribute in the huggingface tokenizer, which describes
"Maximum length of the tokenization output. Defaults to None.".

class PreTrainedHFTokenizer(TokenizerWrapper):
"""Wrapper for pretrained Hugging Face tokenizers."""
def __init__(
self,
pretrained_model_name_or_path: str,
truncation: Optional[bool] = False,
padding: Optional[bool | str] = False,
max_length: Optional[int] = None,
special_tokens: Optional[dict[str, str]] = None,
) -> None:
"""Initializes the PreTrainedHFTokenizer.

We set this flag, when calling the __call__ function on the tokenizer

tokens = self.tokenizer.__call__(
text,
max_length=self.max_length,
padding=self.padding,
truncation=self.truncation,
)["input_ids"]
return tokens

however, as per the huggingface documentation if max_length is set to None, then it uses max input length of the corresponding model.

Huggingface documentation on the __call__ function:

max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

Proposal:
we allow the user to specify max_length in the config (just like before). If it is set to None, we set the max_length to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of the model_max_length attribute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions