Sample packing masks the <end_of_turn> token #2259

ganler · 2025-01-14T00:34:46Z

ganler
Jan 14, 2025

I've SFT'ing models using Gemma template with sample packing enabled and train_on_input set to false.
The SFT'ed model cannot stop generation.
By checking the tokenization, I found the <end_of_turn> token is masked with -100
Not sure if this would cause <end_of_turn> untrained

I wonder if this is expected and if so how to make sure <end_of_turn> is also trained so that the conversation can end gracefully.

Answered by NanoCode012

Jan 16, 2025

I thought that's by default true?

Yes, my bad. I just re-checked code on this.

Could you change the EOS token to :

special_tokens:
  eos_token: <end_of_turn>

View full answer

NanoCode012 · 2025-01-14T11:28:05Z

NanoCode012
Jan 14, 2025
Collaborator

Hello! Yes, that could be one possible reason behind your issue.

Could I see how your dataset config yaml looks like and perhaps a sample demo row (could be fake data) of your dataset?

0 replies

ganler · 2025-01-14T20:51:39Z

ganler
Jan 14, 2025
Author

# model specific
base_model: mistralai/Mistral-7B-v0.3 # it was some internal models but mistral should also repro the bug
chat_template: gemma

# hot hyperparameters
output_dir: ./outputs/mistral-gemma-it
sequence_len: 4096
sample_packing: true
gradient_accumulation_steps: 1
micro_batch_size: 2
learning_rate: 1e-5

# dataset -- can be anything that uses OpenAI chat format / role: ..., content: ...
datasets: 
  - path: [PLACE HOLDER]
    type: chat_template
    field_messages: messages
trust_remote_code: true

# utility
resume_from_checkpoint:
logging_steps: 10
warmup_steps: 100
max_grad_norm: 1.0
save_strategy: "no" # only save the final checkpoint

# trivial
flash_attention: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
bf16: true
tf32: false
pad_to_sequence_len: true # mem stability
train_on_inputs: false
seed: 666

num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: cosine

7 replies

NanoCode012 Jan 15, 2025
Collaborator

Can you enable train_on_eos: turn under type: chat_template?

ganler Jan 15, 2025
Author

I thought that's by default true? EOS token is trained here but "end of turn" as the other special token in the template is unfortunately not covered.

NanoCode012 Jan 16, 2025
Collaborator

I thought that's by default true?

Yes, my bad. I just re-checked code on this.

Could you change the EOS token to :

special_tokens:
  eos_token: <end_of_turn>

Answer selected by NanoCode012

ganler Jan 16, 2025
Author

Yeah I'm using that as a workaround for now. But I still wonder if this is expected behavior and if we can have some option to allow additional special token being trained in the template. Thanks!

NanoCode012 Jan 16, 2025
Collaborator

Yeah I'm using that as a workaround for now.

I wouldn't say it's a workaround, but the expected config to use. The current code needs to know the EOS to determine whether to mask it or not for train_on_eos: . For some reason, some models' EOS is not the same as the one used in their chat_template, so it's necessary to override it.

if we can have some option to allow additional special token being trained in the template

You can always add new tokens via:

tokens:
   - #...
   - #...

I'm not sure if that answered your question?

ganler Jan 22, 2025
Author

Thanks! The explanation and solution make sense to me!

However, I think it might be a good improvement to warn users if

the last token in each conversation turn is not trainable when train_on_eos is turn
or the chat template does not use EOS to stop each turn

-- current default behavior users could run into my case where the model does not stop generation (silently). :)

NanoCode012 Jan 22, 2025
Collaborator

When coding it, I did see this possible situation. However, there were some challenges.

the last token in each conversation turn is not trainable when train_on_eos is turn

What if the template does not use any end of turn tokens (like pygmalion <|model|>..<|user|>)? How could we exclude that case?

We could not just detect that there's a token at the end that isn't trained also as there may be a template which wraps the answer (like llama vision for tool's answer in quotes "{answer}" ).

or the chat template does not use EOS to stop each turn

The template may not use an EOS to end per turn (like example above or alpaca) . Secondly, the code does not know what the user wants as the EOS. We can only infer from tokenizer.eos_token. We do log if the EOS token isn't found in that turn though, however, it's only a LOG.debug instead of INFO/WARNING to prevent a lot of redundant logs. Do you think we should up the priority?

However, I think it might be a good improvement to warn users

I'm currently writing some new docs that would help users be more familiar with chat_template. I understand the current is quite sparse.

If you think there's a better way of handling this, please do let us know!

ganler · 2025-01-14T20:52:22Z

ganler
Jan 14, 2025
Author

I dont have a good public dataset in mind but anything that uses OpenAI message formats can repro this issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample packing masks the <end_of_turn> token #2259

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Sample packing masks the <end_of_turn> token #2259

ganler Jan 14, 2025

Replies: 3 comments · 7 replies

NanoCode012 Jan 14, 2025 Collaborator

ganler Jan 14, 2025 Author

NanoCode012 Jan 15, 2025 Collaborator

ganler Jan 15, 2025 Author

NanoCode012 Jan 16, 2025 Collaborator

ganler Jan 16, 2025 Author

NanoCode012 Jan 16, 2025 Collaborator

ganler Jan 22, 2025 Author

NanoCode012 Jan 22, 2025 Collaborator

ganler Jan 14, 2025 Author

ganler
Jan 14, 2025

Replies: 3 comments 7 replies

NanoCode012
Jan 14, 2025
Collaborator

ganler
Jan 14, 2025
Author

NanoCode012 Jan 15, 2025
Collaborator

ganler Jan 15, 2025
Author

NanoCode012 Jan 16, 2025
Collaborator

ganler Jan 16, 2025
Author

NanoCode012 Jan 16, 2025
Collaborator

ganler Jan 22, 2025
Author

NanoCode012 Jan 22, 2025
Collaborator

ganler
Jan 14, 2025
Author