Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Potential bug of the special token for assistant when using HF #1158

Open
4 tasks done
ghrua opened this issue Jan 10, 2025 · 1 comment
Open
4 tasks done

[Bug]: Potential bug of the special token for assistant when using HF #1158

ghrua opened this issue Jan 10, 2025 · 1 comment

Comments

@ghrua
Copy link

ghrua commented Jan 10, 2025

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7B-Instruct

What is the scenario where the problem happened?

inference with transformers

Is this a known issue?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find an answer there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 22.04.5
Python: Python 3.11.10
GPUs: 1 x A6000

Log output

n/a

Description

Hey team, I am not sure what's the best special token for the start of assistant? <|im_start|>assistant or <|im_start|>Assistant?

More concretely, if I use the following script to prepare the chat prompt (recommended by the model's page on HF):

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

The special token for assistant will be <|im_start|>assistant when printing text:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant

However, if I set add_generation_prompt=False, the Qwen2.5-7B model will generate the special token for assistant during inference:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>Assistant Sure! A Large Language Model (LLM) is a type of ...

The generated start special token for assistant becomes <|im_start|>Assistant, where assistant is capitalized. I tried 10 different random seeds, and the Qwen model consistently generates <|im_start|>Assistant rather than assistant. Therefore, I guess the Qwen model used <|im_start|>Assistant in training.

The style of the response generated by starting from<|im_start|>Assistant is also better than from <|im_start|>assistant:

// Example of add_generation_prompt=False, seed=0
<|im_start|>Assistant Sure! A large language model (LLM) is a type of artificial intelligence model designed to understand and generate human-like text. These models are typically based on neural networks, specifically transformer architectures, which allow them to process and generate text in a way that reflects complex patterns and structures found in human language.

Large language models are trained on vast amounts of textual data from the internet, books, articles, and other sources. This extensive training enables them to learn a wide range of linguistic nuances, including syntax, semantics, and context. As a result, LLMs can perform various natural language processing tasks, such as translation, summarization, question-answering, and even creative writing.

Some notable examples of large language models include:

1. **GPT (Generative Pre-trained Transformer)**: Developed by OpenAI, GPT series have been pivotal in advancing the field of language generation.
2. **BERT (Bidirectional Encoder Representations from Transformers)**: Created by Google, BERT has significantly improved performance on a variety of NLP tasks through its bidirectional training approach.
3. **T5 (Text-to-Text Transfer Transformer)**: Also developed by Google, T5 uses a unified framework for various NLP tasks, making it versatile and efficient.

These models are characterized by their size, with many having billions or even trillions of parameters, which contribute to their ability to handle complex and nuanced language tasks. However, they also come with challenges, such as potential biases in the training data and the need for careful ethical considerations when deploying these models.<|im_end|>

********************************************************************************

// Example of add_generation_prompt=True, seed=0
<|im_start|>assistant A large language model (LLM) is a type of artificial intelligence model designed to understand and generate human-like text. These models are typically based on deep learning techniques, particularly transformer architectures, and are trained on vast amounts of textual data from the internet, books, articles, and other sources. LLMs can perform various natural language processing tasks such as translation, summarization, question answering, and text generation. They are characterized by their size, often containing billions or even trillions of parameters, which allows them to capture complex patterns in language and context. Some well-known examples of LLMs include models developed by companies like Alibaba, Anthropic, Google, and Alibaba, among others.<|im_end|>
@ghrua ghrua changed the title [Bug]: Potential bug of apply_chat_template [Bug]: Potential bug of the special start token for assistant when using HF Jan 10, 2025
@ghrua ghrua changed the title [Bug]: Potential bug of the special start token for assistant when using HF [Bug]: Potential bug of the special token for assistant when using HF Jan 10, 2025
@jklj077
Copy link
Collaborator

jklj077 commented Jan 13, 2025

hi,

very interesting findings on the styles. it is as expected that the chat templated uses "assistant". I believe that the roles are "masked" in finetuning, i.e., the models were not trained to predict "assistant" based on the previous tokens, which may be the cause for your observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants