Models without tokenizer.model, exl2 conversion. #177

mefich · 2023-11-23T18:49:49Z

mefich
Nov 23, 2023

Hello, I've recently checked a few Llama2-compatible models that don't provide tokenizer.model file. From my very limited understanding it's SentencePiece tokenizer that comes with Llama2 and that Exllama relies on it. Meanwhile these models rely on HF tokenizer.json instead that's compatible with other loaders but not Exllama. Oobabooga's WebUI can deal with it with its _HF wrappers and it works okay.

Except exl2 convert scripts don't have such a workaround and still require tokenizer.model file to work.

Is it possible to make convert script work with such models? I've tried to convert a few models with a generic Llama2 tokenizer.model file and it usually worked. But I suspect there's no guarantee that the tokenizer.model and tokenizer.json would be equal and maybe there could be license collisions if it's used for models that use a different license than Llama2.

turboderp · 2023-11-24T11:46:01Z

turboderp
Nov 24, 2023
Maintainer

Yeah, I've been considering what to do about those models. It's very annoying that they just won't publish the tokenizer model. Probably the simplest answer is to reverse-engineer the SP file format and figure out a way to reconstruct the tokenizer.model from the JSON vocabulary.

If I get stuck on some of the many other things I'd rather be working on and need a change of scenery, I'll give it a go? 🤷

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Models without tokenizer.model, exl2 conversion. #177

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Models without tokenizer.model, exl2 conversion. #177

Uh oh!

mefich Nov 23, 2023

Replies: 1 comment

Uh oh!

turboderp Nov 24, 2023 Maintainer

mefich
Nov 23, 2023

turboderp
Nov 24, 2023
Maintainer