-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor question about PAD token and EOS token. #127
Comments
Setting the pad token to eos is an issue on our training as well. What I do not get is how Zephyr was trained with such recipe, since Mistral does not have a pad token, the same problem arises, and its chat template includes an eos at the end of each conversation turn. So while the same thing should happen when training on top of Mistral, HuggingFaceH4/mistral-7b-sft-beta seems able to generate eos tokens just fine. Was this addressed in any way during training of Zephyr? |
This is true, i have tried SFT using the script above. And the model does not learn how to stop generating. it basically sets all pad_token_id to be ignored. This is regardless if packing is used. I think there is only 2 ways about this.
or 2) Use |
Hello,
Thank you for sharing this awesome resource!
I have a question regarding models that already have a chat template like "mistralai/Mistral-7B-Instruct-v0.1". I'm planning on using the non packed dataset. I applied the chat template that comes with the tokenizer as a preprocessing step as suggested. If I decode the samples inside the SFTTrainer after tokenization, they start with two BOS tokens. This is because the tokenizer adds a special token (BOS token in this case because it is set to True in the tokenizer config) in addition to the one in the chat template. To fix this, I need to pass
dataset_kwargs={"add_special_tokens": False}
to the SFTTrainer.Another issue I'm having is that when the pad token is the same as the EOS token, the EOS token label is -100. This might cause the model to continue generating and never stop, right? I'm having this "phenomena" with my finetuned models on my own dataset using the SFT code provided. One workaround would be to code my own data collator that takes this into account instead of using
DataCollatorForLanguageModeling
. I also found a related issue on the matter here.Any comments and guidance are very much appreciated!
The text was updated successfully, but these errors were encountered: