You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.
I've downloaded the weights for OPT-175B using the URL I got after filling out the Google form. I've also got dict.txt, gpt2-merges.txt, and gpt2-vocab.json. My existing workflow uses the Hugging Face API, so I've converted the weights to HF format using the script here.
However, I'm not sure how to convert the tokenizer to HF format from the files. I see there is a way to make a tokenizer using the gpt2-merges.txt and gpt2-vocab.json files, but that means dict.txt is unused, which strikes me as likely to cause issues (I can't imagine it would exist if it were not needed). Is there a way to do this?
As an alternative, the smaller OPT models and their tokenizers are available on the HF Hub, so I can just get them from there. Do all the OPT models, including 175B, use the same tokenizer?
If it doesn't make a difference, I could just use the tokenizer from HF for one of the smaller models instead. I could easily verify for myself whether the smaller models have identical tokenizers by comparing the HF tokenizers for the different sizes, but that won't tell me necessarily whether 175B uses the same one, since it's not on there as such.
The text was updated successfully, but these errors were encountered:
After some testing, it appears that the tokenizers on HF are probably the same as the one for OPT-175B (at the very least, my output for a short test made sense when decoded with the tokenizer available on HF for facebook/opt-125m). But it'd still be nice to be sure, just in case.
- OPT has the same architecture as BartDecoder.
- Contrary to GPT2, OPT adds the EOS token </s> to the beginning of every prompt. Note: Make sure to pass use_fast=False when loading OPT’s tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.2/en/model_doc/auto#transformers.AutoTokenizer) to get the correct tokenizer.
❓ Questions and Help
What is your question?
I've downloaded the weights for OPT-175B using the URL I got after filling out the Google form. I've also got
dict.txt
,gpt2-merges.txt
, andgpt2-vocab.json
. My existing workflow uses the Hugging Face API, so I've converted the weights to HF format using the script here.However, I'm not sure how to convert the tokenizer to HF format from the files. I see there is a way to make a tokenizer using the
gpt2-merges.txt
andgpt2-vocab.json
files, but that meansdict.txt
is unused, which strikes me as likely to cause issues (I can't imagine it would exist if it were not needed). Is there a way to do this?As an alternative, the smaller OPT models and their tokenizers are available on the HF Hub, so I can just get them from there. Do all the OPT models, including 175B, use the same tokenizer?
If it doesn't make a difference, I could just use the tokenizer from HF for one of the smaller models instead. I could easily verify for myself whether the smaller models have identical tokenizers by comparing the HF tokenizers for the different sizes, but that won't tell me necessarily whether 175B uses the same one, since it's not on there as such.
The text was updated successfully, but these errors were encountered: