Skip to content

fix edge case for qwen3 data processing #626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link
Member

With Qwen3, there's an edge case which can result in the unmask/mask logic breaking during data processing.

Root Cause: The error occurs specifically when using the Qwen/Qwen3-32B tokenizer, not with Qwen/Qwen2.5-32B-Instruct. The problematic sample contains multiple tags in the assistant's
response.

Issue Location: The error occurs in data_process.py:555 in the unmask_messages function, where it encounters an <|UNMASK_END|> token while not in an unmasking state.

Key Findings:

  1. Model-specific issue: The sample processes fine with Qwen/Qwen2.5-32B-Instruct but fails with Qwen/Qwen3-32B
  2. Chat template differences: Different models have different chat templates that may tokenize the unmask tokens differently
  3. Token ordering: The issue suggests that the unmask tokens are getting reordered or processed incorrectly by the Qwen3 chat template

The Problem:
The Qwen/Qwen3-32B model's chat template is processing the <|UNMASK_BEGIN|> and <|UNMASK_END|> tokens in a way that causes them to appear out of order or in an unexpected state, leading to
the algorithm encountering an <|UNMASK_END|> token when it's not actively unmasking.

This is likely due to differences in how the chat templates of Qwen2.5 vs Qwen3 handle special tokens, particularly when there are multiple special tokens or complex content like the
tags present in the assistant's response.

Signed-off-by: Oleg S [email protected]

@mergify mergify bot added the ci-failure label Jun 23, 2025
@RobotSail RobotSail force-pushed the fix-data-processing branch from d9a21d8 to 79c2f2e Compare June 24, 2025 04:04
@mergify mergify bot added documentation Improvements or additions to documentation testing Relates to testing labels Jun 24, 2025
@RobotSail RobotSail force-pushed the fix-data-processing branch 2 times, most recently from d52cec5 to 98c3c73 Compare June 25, 2025 04:32
@RobotSail RobotSail force-pushed the fix-data-processing branch from 98c3c73 to aba5cb8 Compare June 25, 2025 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure documentation Improvements or additions to documentation testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant