Consistency issue with max_length flag in tokenizer

### System Info

all versions 

### 🐛 Describe the bug

Currently, we set the `max_length` attribute in the huggingface tokenizer, which describes 
"Maximum length of the tokenization output. Defaults to None.".

https://github.com/Modalities/modalities/blob/09c666bdfb1533f97524e731a6a8c540fc67be24/src/modalities/tokenization/tokenizer_wrapper.py#L66-L77


We set this flag, when calling the `__call__` function on the tokenizer

https://github.com/Modalities/modalities/blob/09c666bdfb1533f97524e731a6a8c540fc67be24/src/modalities/tokenization/tokenizer_wrapper.py#L133-L139

however, as per the [huggingface documentation](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__) if `max_length` is set to None, then it uses max input length of the corresponding model.

Huggingface documentation on the `__call__` function:
> max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
> If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.


Proposal: 
we allow the user to specify `max_length` in the config (just like before). If it is set to `None`, we set the `max_length` to a very large integer e.g., int(1e30) in the constructor of the tokenizer. This way, tokenization is always irrespective of the `model_max_length` attribute.



	class PreTrainedHFTokenizer(TokenizerWrapper):
	"""Wrapper for pretrained Hugging Face tokenizers."""

	def __init__(
	self,
	pretrained_model_name_or_path: str,
	truncation: Optional[bool] = False,
	padding: Optional[bool \| str] = False,
	max_length: Optional[int] = None,
	special_tokens: Optional[dict[str, str]] = None,
	) -> None:
	"""Initializes the PreTrainedHFTokenizer.

	tokens = self.tokenizer.__call__(
	text,
	max_length=self.max_length,
	padding=self.padding,
	truncation=self.truncation,
	)["input_ids"]
	return tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consistency issue with max_length flag in tokenizer #274

System Info

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consistency issue with max_length flag in tokenizer #274

Description

System Info

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions