Description
System Info
all versions
🐛 Describe the bug
Currently, we set the max_length
attribute in the huggingface tokenizer, which describes
"Maximum length of the tokenization output. Defaults to None.".
modalities/src/modalities/tokenization/tokenizer_wrapper.py
Lines 66 to 77 in 09c666b
We set this flag, when calling the __call__
function on the tokenizer
modalities/src/modalities/tokenization/tokenizer_wrapper.py
Lines 133 to 139 in 09c666b
however, as per the huggingface documentation if max_length
is set to None, then it uses max input length of the corresponding model.
Huggingface documentation on the __call__
function:
max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
Proposal:
we allow the user to specify max_length
in the config (just like before). If it is set to None
, we set the max_length
to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of the model_max_length
attribute.