tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

OdinOps · 2024-11-15T08:42:33Z

System Info

transformers version: 4.45.0
Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
Python version: 3.10.15
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+rocm6.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: AMD Instinct MI250X/MI250

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When I load and then save the tokenizer with OLMO models, the tokenizer.json files appear different, particularly with the merge key.

The code to reproduce that is :

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B-0724-hf")
tokenizer.save_pretrained("saved_tokenizer")

Expected behavior

The original tokenizer.json and the saved tokenizer.json should be the same.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-11-15T09:19:37Z

Hey @zzf1130, I believe this change is to make the tokenizer.json more flexible/future-proof, and is therefore a voluntary change.

I will let @ArthurZucker comment on it, thank you!

DtYXs · 2024-11-24T06:42:36Z

Hey @zzf1130, I believe this change is to make the tokenizer.json more flexible/future-proof, and is therefore a voluntary change.

I will let @ArthurZucker comment on it, thank you!

Hello. This change cause a error when I use vllm to run a server.

  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 115, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: data did not match any variant of untagged enum ModelWrapper at line 757443 column 3

And I have to replace it with the original one.

ArthurZucker · 2024-11-25T16:28:17Z

Hey! This is somewhat expected, we had a breaking change in tokenizers which breaks forward compatibility (so new models won't be laoded) make sure to update transformers

davidmezzetti · 2024-11-27T14:36:55Z

@ArthurZucker

I ran into this when quantizing a Llama model. This doubles the file size of tokenizer.json and when attempting to check in the quantized model, git rejected tokenizer.json (it went from 9MB to 17MB).

This might require a lot of projects to add tokenizer.json to .gitattributes as a lfs file. I had to spend quite a while tracing why quantization would change the tokenizer.

I ended up copying the tokenizer files from the base model as I didn't think it would be good practice to change the tokenizer for a quantized model.

ArthurZucker · 2024-12-02T10:37:46Z

Ah yes that's an issue. This is related to the pretty-print format, will try to fix for tokenizers==0.21.1 !

davidmezzetti · 2024-12-02T11:43:03Z

That should do the trick!

Narsil · 2024-12-02T12:38:13Z

@ArthurZucker

I ran into this when quantizing a Llama model. This doubles the file size of tokenizer.json and when attempting to check in the quantized model, git rejected tokenizer.json (it went from 9MB to 17MB).

This might require a lot of projects to add tokenizer.json to .gitattributes as a lfs file. I had to spend quite a while tracing why quantization would change the tokenizer.

I ended up copying the tokenizer files from the base model as I didn't think it would be good practice to change the tokenizer for a quantized model.

9MB is really high already, the limit is 10MB for lfs, so any nudge will push it into LFS territory. Removed pretty printing and it's still at 12MB. Everything zipped is 2.5MB (both the 17MB and 12MB versions) showing how much redundant that data is...
Culprit: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/raw/main/tokenizer.json

Copying the tokenizer

Seems like a good strategy too.

Narsil · 2024-12-02T12:42:03Z

Linked to huggingface/tokenizers#1656.

Personally I think if we're going to disable pretty printing I'd remove JSON altogether to get to this 2.5MB target directly.
Another option would be to do better merges which would be to store them as they are in memory which is instead of storing pairs of tokens as bytes, we can store only the token id in the merge. Depending on the size of the tokens it could be a compression (would definitely save 4 bytes on each of the " quoting the strings + removing any utf-8 there).

Edit: I don't think moving away from JSON is good, JSON is made to be readable editable, which is great. But making it unreadable by not pretty-printing doesn't yield enough wins it seems, and therefore if we care that much about size there are probably better options.

ArthurZucker · 2024-12-10T07:33:32Z

So TLDR do you want to go with using ids instead of strings?

github-actions · 2025-01-03T08:04:30Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

OdinOps added the bug label Nov 15, 2024

github-actions bot closed this as completed Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

OdinOps commented Nov 15, 2024

LysandreJik commented Nov 15, 2024

DtYXs commented Nov 24, 2024

ArthurZucker commented Nov 25, 2024

davidmezzetti commented Nov 27, 2024

ArthurZucker commented Dec 2, 2024

davidmezzetti commented Dec 2, 2024

Narsil commented Dec 2, 2024

Narsil commented Dec 2, 2024 •

edited

Loading

ArthurZucker commented Dec 10, 2024

github-actions bot commented Jan 3, 2025

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

Comments

OdinOps commented Nov 15, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Nov 15, 2024

DtYXs commented Nov 24, 2024

ArthurZucker commented Nov 25, 2024

davidmezzetti commented Nov 27, 2024

ArthurZucker commented Dec 2, 2024

davidmezzetti commented Dec 2, 2024

Narsil commented Dec 2, 2024

Narsil commented Dec 2, 2024 • edited Loading

ArthurZucker commented Dec 10, 2024

github-actions bot commented Jan 3, 2025

Narsil commented Dec 2, 2024 •

edited

Loading