You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.
Information about environment
OS: Ubuntu 22.04.5
Python: Python 3.11.10
GPUs: 1 x A6000
Log output
n/a
Description
Hey team, I am not sure what's the best special token for the start of assistant? <|im_start|>assistant or <|im_start|>Assistant?
More concretely, if I use the following script to prepare the chat prompt (recommended by the model's page on HF):
prompt="Give me a short introduction to large language model."messages= [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text=tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
The special token for assistant will be <|im_start|>assistant when printing text:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
However, if I set add_generation_prompt=False, the Qwen2.5-7B model will generate the special token for assistant during inference:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>Assistant Sure! A Large Language Model (LLM) is a type of ...
The generated start special token for assistant becomes <|im_start|>Assistant, where assistant is capitalized. I tried 10 different random seeds, and the Qwen model consistently generates <|im_start|>Assistant rather than assistant. Therefore, I guess the Qwen model used <|im_start|>Assistant in training.
The style of the response generated by starting from<|im_start|>Assistant is also better than from <|im_start|>assistant:
// Example of add_generation_prompt=False, seed=0
<|im_start|>Assistant Sure! A large language model (LLM) is a type of artificial intelligence model designed to understand and generate human-like text. These models are typically based on neural networks, specifically transformer architectures, which allow them to process and generate text in a way that reflects complex patterns and structures found in human language.
Large language models are trained on vast amounts of textual data from the internet, books, articles, and other sources. This extensive training enables them to learn a wide range of linguistic nuances, including syntax, semantics, and context. As a result, LLMs can perform various natural language processing tasks, such as translation, summarization, question-answering, and even creative writing.
Some notable examples of large language models include:
1. **GPT (Generative Pre-trained Transformer)**: Developed by OpenAI, GPT series have been pivotal in advancing the field of language generation.
2. **BERT (Bidirectional Encoder Representations from Transformers)**: Created by Google, BERT has significantly improved performance on a variety of NLP tasks through its bidirectional training approach.
3. **T5 (Text-to-Text Transfer Transformer)**: Also developed by Google, T5 uses a unified framework for various NLP tasks, making it versatile and efficient.
These models are characterized by their size, with many having billions or even trillions of parameters, which contribute to their ability to handle complex and nuanced language tasks. However, they also come with challenges, such as potential biases in the training data and the need for careful ethical considerations when deploying these models.<|im_end|>
********************************************************************************
// Example of add_generation_prompt=True, seed=0
<|im_start|>assistant A large language model (LLM) is a type of artificial intelligence model designed to understand and generate human-like text. These models are typically based on deep learning techniques, particularly transformer architectures, and are trained on vast amounts of textual data from the internet, books, articles, and other sources. LLMs can perform various natural language processing tasks such as translation, summarization, question answering, and text generation. They are characterized by their size, often containing billions or even trillions of parameters, which allows them to capture complex patterns in language and context. Some well-known examples of LLMs include models developed by companies like Alibaba, Anthropic, Google, and Alibaba, among others.<|im_end|>
The text was updated successfully, but these errors were encountered:
ghrua
changed the title
[Bug]: Potential bug of apply_chat_template
[Bug]: Potential bug of the special start token for assistant when using HF
Jan 10, 2025
ghrua
changed the title
[Bug]: Potential bug of the special start token for assistant when using HF
[Bug]: Potential bug of the special token for assistant when using HF
Jan 10, 2025
very interesting findings on the styles. it is as expected that the chat templated uses "assistant". I believe that the roles are "masked" in finetuning, i.e., the models were not trained to predict "assistant" based on the previous tokens, which may be the cause for your observation.
Model Series
Qwen2.5
What are the models used?
Qwen2.5-7B-Instruct
What is the scenario where the problem happened?
inference with transformers
Is this a known issue?
Information about environment
OS: Ubuntu 22.04.5
Python: Python 3.11.10
GPUs: 1 x A6000
Log output
Description
Hey team, I am not sure what's the best special token for the start of assistant?
<|im_start|>assistant
or<|im_start|>Assistant
?More concretely, if I use the following script to prepare the chat prompt (recommended by the model's page on HF):
The special token for assistant will be
<|im_start|>assistant
when printingtext
:However, if I set
add_generation_prompt=False
, the Qwen2.5-7B model will generate the special token for assistant during inference:The generated start special token for assistant becomes
<|im_start|>Assistant
, whereassistant
is capitalized. I tried 10 different random seeds, and the Qwen model consistently generates<|im_start|>Assistant
rather thanassistant
. Therefore, I guess the Qwen model used<|im_start|>Assistant
in training.The style of the response generated by starting from
<|im_start|>Assistant
is also better than from<|im_start|>assistant
:The text was updated successfully, but these errors were encountered: