fix edge case for qwen3 data processing #626

RobotSail · 2025-06-23T19:39:08Z

With Qwen3, there's an edge case which can result in the unmask/mask logic breaking during data processing.

Root Cause: The error occurs specifically when using the Qwen/Qwen3-32B tokenizer, not with Qwen/Qwen2.5-32B-Instruct. The problematic sample contains multiple tags in the assistant's
response.

Issue Location: The error occurs in data_process.py:555 in the unmask_messages function, where it encounters an <|UNMASK_END|> token while not in an unmasking state.

Key Findings:

Model-specific issue: The sample processes fine with Qwen/Qwen2.5-32B-Instruct but fails with Qwen/Qwen3-32B
Chat template differences: Different models have different chat templates that may tokenize the unmask tokens differently
Token ordering: The issue suggests that the unmask tokens are getting reordered or processed incorrectly by the Qwen3 chat template

This is likely due to differences in how the chat templates of Qwen2.5 vs Qwen3 handle special tokens, particularly when there are multiple special tokens or complex content like the
tags present in the assistant's response.

Signed-off-by: Oleg S [email protected]

…_content` fields Signed-off-by: Oleg S <[email protected]>

mergify bot added the ci-failure label Jun 23, 2025

RobotSail force-pushed the fix-data-processing branch from d9a21d8 to 79c2f2e Compare June 24, 2025 04:04

mergify bot added documentation Improvements or additions to documentation testing Relates to testing labels Jun 24, 2025

RobotSail force-pushed the fix-data-processing branch 2 times, most recently from d52cec5 to 98c3c73 Compare June 25, 2025 04:32

enable the training repo to handle reasoning traces within `reasoning…

aba5cb8

…_content` fields Signed-off-by: Oleg S <[email protected]>

RobotSail force-pushed the fix-data-processing branch from 98c3c73 to aba5cb8 Compare June 25, 2025 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix edge case for qwen3 data processing #626

fix edge case for qwen3 data processing #626

RobotSail commented Jun 23, 2025

Uh oh!

Uh oh!

fix edge case for qwen3 data processing #626

Are you sure you want to change the base?

fix edge case for qwen3 data processing #626

Conversation

RobotSail commented Jun 23, 2025

Uh oh!

Uh oh!