Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

Converting OPT-175B tokenizer to HF format? #704

Open
mawilson1234 opened this issue Apr 7, 2023 · 2 comments
Open

Converting OPT-175B tokenizer to HF format? #704

mawilson1234 opened this issue Apr 7, 2023 · 2 comments
Labels
question Further information is requested

Comments

@mawilson1234
Copy link

❓ Questions and Help

What is your question?

I've downloaded the weights for OPT-175B using the URL I got after filling out the Google form. I've also got dict.txt, gpt2-merges.txt, and gpt2-vocab.json. My existing workflow uses the Hugging Face API, so I've converted the weights to HF format using the script here.

However, I'm not sure how to convert the tokenizer to HF format from the files. I see there is a way to make a tokenizer using the gpt2-merges.txt and gpt2-vocab.json files, but that means dict.txt is unused, which strikes me as likely to cause issues (I can't imagine it would exist if it were not needed). Is there a way to do this?

As an alternative, the smaller OPT models and their tokenizers are available on the HF Hub, so I can just get them from there. Do all the OPT models, including 175B, use the same tokenizer?

If it doesn't make a difference, I could just use the tokenizer from HF for one of the smaller models instead. I could easily verify for myself whether the smaller models have identical tokenizers by comparing the HF tokenizers for the different sizes, but that won't tell me necessarily whether 175B uses the same one, since it's not on there as such.

@mawilson1234 mawilson1234 added the question Further information is requested label Apr 7, 2023
@mawilson1234 mawilson1234 changed the title Converting tokenizer to HF format? Converting OPT-175B tokenizer to HF format? Apr 7, 2023
@mawilson1234
Copy link
Author

mawilson1234 commented Apr 7, 2023

After some testing, it appears that the tokenizers on HF are probably the same as the one for OPT-175B (at the very least, my output for a short test made sense when decoded with the tokenizer available on HF for facebook/opt-125m). But it'd still be nice to be sure, just in case.

@ayeeyecorp
Copy link

ayeeyecorp commented Apr 13, 2023

@mawilson1234 I believe you are correct. I used "tokenizer_config.json" & "special_tokens_map.json" from HF OPT model repo.

Tips (OPT HF link):

- OPT has the same architecture as BartDecoder.
- Contrary to GPT2, OPT adds the EOS token </s> to the beginning of every prompt. Note: Make sure to pass use_fast=False when loading OPT’s tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.2/en/model_doc/auto#transformers.AutoTokenizer) to get the correct tokenizer.

You can try generating tokenizer with:

    vocab_file = os.path.join(model_path, "gpt2-vocab.json")
    merges_file = os.path.join(model_path, "gpt2-merges.txt")
    tokenizer = GPT2Tokenizer(vocab_file, merges_file)
    tokenizer.save_pretrained(model_path)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants