Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

Open
2 of 4 tasks
zzf1130 opened this issue Nov 15, 2024 · 1 comment
Open
2 of 4 tasks
Labels

Comments

@zzf1130
Copy link

zzf1130 commented Nov 15, 2024

System Info

  • transformers version: 4.45.0
  • Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
  • Python version: 3.10.15
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+rocm6.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: AMD Instinct MI250X/MI250

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I load and then save the tokenizer with OLMO models, the tokenizer.json files appear different, particularly with the merge key.

image

The code to reproduce that is :

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B-0724-hf")
tokenizer.save_pretrained("saved_tokenizer")

Expected behavior

The original tokenizer.json and the saved tokenizer.json should be the same.

@zzf1130 zzf1130 added the bug label Nov 15, 2024
@LysandreJik
Copy link
Member

Hey @zzf1130, I believe this change is to make the tokenizer.json more flexible/future-proof, and is therefore a voluntary change.

I will let @ArthurZucker comment on it, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants