Bugfix train tokenization qwen #33

maxime-louis · 2024-11-18T10:06:01Z

Label delimitation problem fix with Qwen:
Mistral/llama/solar/gemma tokenizers all have:
user_message + generatino_prompt + answer != chat_template(user_message, answer):
As they always add an extra newline in the second case. Qwen doesn't do that, so our way of using the length of "user_message + generatino_prompt" (:= position) to delimitate the label for training does not work.

With this fix:

we try to recover the label at (position) and (position - 1)

=> Behaviour is unchanged for all models but Qwen
=> Alexandre ran multiple experiments with Qwen and that code: all works well, no warning.

Qwen pad_token handling
Qwen has no bos_token, which we used as padding in bergen: we take pad_token if it exists (it does for Qwen), and eos_token in last resort.

…tion-qwen

Maxime LOUIS and others added 3 commits October 28, 2024 12:24

bugfix

dcadf11

Merge remote-tracking branch 'origin/main' into bugfix-train-tokeniza…

8d20282

…tion-qwen

nicer code

afba559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix train tokenization qwen #33

Bugfix train tokenization qwen #33

maxime-louis commented Nov 18, 2024

Bugfix train tokenization qwen #33

Are you sure you want to change the base?

Bugfix train tokenization qwen #33

Conversation

maxime-louis commented Nov 18, 2024