diff --git a/blog/2024-08-28-style-control.md b/blog/2024-08-28-style-control.md deleted file mode 100644 index 3e90d3a2..00000000 --- a/blog/2024-08-28-style-control.md +++ /dev/null @@ -1,397 +0,0 @@ ---- -title: "Does style matter? Disentangling style and substance in Chatbot Arena" -author: "Tianle Li*, Anastasios Angelopoulos*, Wei-Lin Chiang*" -date: "Aug 29, 2024" -previewImg: /images/blog/style_control/logo.png ---- - -Why is GPT-4o-mini so good? Why does Claude rank so low, when anecdotal experience suggests otherwise? - -We have answers for you. We controlled for the effect of length and markdown, and indeed, *the ranking changed*. This is just a first step towards our larger goal of disentangling **substance** and **style** in Chatbot Arena leaderboard. - -**Check out the results below!** Style indeed has a strong effect on models’ performance in the leaderboard. This makes sense—from the perspective of human preference, it’s not just what you say, but how you say it. But now, we have a way of _separating_ the effect of writing style from the content, so you can see both effects individually. - -When adjusting for length and style, we found noticeable shifts in the ranking. GPT-4o-mini and Grok-2-mini drop below most frontier models, and Claude 3.5 Sonnet, Opus, and Llama-3.1-405B rise substantially. In the Hard Prompt subset, we observe Claude 3.5 Sonnet jumping to joint #1 with chatgpt-4o-latest, and Llama improved significantly. We are looking forward to seeing what the community does with this new tool for disaggregating style and substance. - - -## Overall ranking + Style Control - -

Figure 1. Overall Chatbot Arena ranking vs Overall Chatbot Arena ranking where answer length, markdown header count, markdown bold count, and markdown list element count are being “controlled”.

- -## Hard Prompt ranking + Style Control - -

Figure 2. Hard Prompt category ranking vs Hard Prompt category ranking where answer length, markdown header count, markdown bold count, and markdown list element count are being “controlled”.

- -Leaderboard [link](lmarena.ai/?leaderboard) - -Colab [link](https://colab.research.google.com/drive/19VPOril2FjCX34lJoo7qn4r6adgKLioY#scrollTo=dYANZPG_8a9N) - - - -We will be rolling out style control soon to all the categories. Stay tuned! - -## Methodology - -**High-Level Idea.** The goal here is to understand the effect of _style_ vs _substance_ on the Arena Score. Consider models A and B. Model A is great at producing code, factual and unbiased answers, etc., but it outputs short and terse responses. Model B is not so great on substance (e.g., correctness), but it outputs great markdown, and gives long, detailed, flowery responses. Which is better, model A, or model B? - -The answer is not one dimensional. Model A is better on substance, and Model B is better on style. Ideally, we would have a way of teasing apart this distinction: capturing how much of the model’s Arena Score is due to substance or style. - -Our methodology is a first step towards this goal. We explicitly model style as an independent variable in our Bradley-Terry regression. For example, we added length as a feature—just like each model, the length difference has its _own_ Arena Score! By doing this, we expect that the Arena Score of each model will reflect its strength, controlled for the effect of length. - -Please read below for the technical details. We also controlled not just for length, but also a few other style features. As a first version, we propose controlling -1. Answer token length -2. Number of markdown headers -3. Number of markdown bold elements -4. Number of markdown lists - -We publicly release our data with vote and style elements and code on [google colab](https://colab.research.google.com/drive/19VPOril2FjCX34lJoo7qn4r6adgKLioY#scrollTo=dYANZPG_8a9N)! You can try out experimenting with style control now. More improvements to come, and please reach out if you want to help contribute! - -**Background.** To produce the results above, we controlled for the effect of style by adding extra “style features” into our Bradley-Terry regression. This is a [standard technique](https://en.wikipedia.org/wiki/Controlling_for_a_variable) in statistics, and has been recently used in LLM evaluations, such as AlpacaEval 2.0 [1]. Additionally, there are studies suggesting potential bias for “prettier” and more detailed responses in humans [2, 3]. The idea is that, by including any confounding variables (e.g. response length) in the regression, we can attribute any increase in strength to the confounder, as opposed to the model. Then, the Bradley-Terry coefficient will be more reflective of the model’s intrinsic ability, as opposed to possible confounders. The definition of a confounder is to some extent up to our interpretation; as our style features, we use the (normalized) difference in response lengths, the number of markdown headers, and the number of lists. - -More formally, consider vectors $X_1, \ldots, X_n \in \mathbb{R}^M$ and $Y_1, \ldots, Y_n \in \{0,1\}$, where $n$ is the number of battles and $M$ is the number of models. - -For every $i \in [n]$, We have that $X_{i,m}=1$ only if model $m \in [M]$ is the model shown in the left-hand side in Chatbot Arena, and $X_{i,m}=-1$ only if it is shown on the right. That is, $X_i$ is a two-hot vector. The outcome $Y_i$ takes the value $Y_i=1$ if the left-hand model wins, and $Y_i=0$ otherwise. - -The standard method for computing the Arena Score (i.e., the Bradley-Terry coefficients, which we formerly called the Elo score) is to run a logistic regression of $Y_i$ onto $X_i$. That is, for every model $m$, we associate a scalar $\hat{\beta}_m$ that describes its strength, and the vector $\hat{\beta}$ is determined by solving the following logistic regression: - -$$\hat{\beta} = \arg \min_{\beta \in \mathbb{R}^M} \frac{1}{n}\sum\limits_{i=1}^n \mathsf{BCELoss}(X_i^\top \beta, Y_i)$$ - -where $\mathsf{BCELoss}$ represents the binary cross-entropy loss. (In practice, we also reweight this objective to handle non-uniform model sampling, but let’s ignore that for now.) - -## Style Control - -Now, for every battle $i \in [n]$, let’s say that in addition to $X_i$ that we observe some additional style features, $Z_i \in \mathbb{R}^S$. These style features can be as simple or complicated as you want. For example, $Z_i$ could just be the difference in response lengths of the two models, in which case $S=1$. Or, we could have $S>1$ and include other style-related features, for example, the number of markdown headers, or even style features that are automatically extracted by a model! - -Here, we define each style feature as -$$\text{normalize }(\frac{\text{feature}_A - \text{feature}_B}{\text{feature}_A + \text{feature}_B})$$ - -For example, the first new feature, token length difference between answer A and answer B, would be expressed as -$$\text{normalize }(\frac{\text{length}_A - \text{length}_B}{\text{length}_A + \text{length}_B})$$ - -We divide the difference by the sum of both answers' token length to make the length difference proportional to the pairwise answer token lengths. An answer with 500 tokens is roughly equal in length to an answer with 520 tokens, while an answer with 20 tokens is very different from an answer with 40 tokens, even though the difference is 20 tokens for both scenarios. - -The idea of style control is very basic. We perform the same logistic regression as below: -$$\hat{\beta}, \hat{\gamma} = \arg \min_{\beta \in \mathbb{R}^M, \gamma \in \mathbb{R}^S} \frac{1}{n}\sum\limits_{i=1}^n \mathsf{BCELoss}(X_i^\top \beta + Z_i^{\top}\gamma, Y_i).$$ -We refer to the results $\hat{\beta}$ and $\hat{\gamma}$ as the “model coefficients” and the “style coefficients” respectively. The model coefficients have the same interpretation as before; however, they are controlled for the effect of style, which is explicitly modeled by the style coefficients! - -When the style coefficients are big, that means that the style feature has a big effect on the response. To define “big”, you need to properly normalize the style coefficients so they can be compared. All in all, when analyzing the style coefficients, we found that length was the dominant style factor. All other markdown effects are second order. - -We report the following coefficient for each style attribute across different methods of controlling the style. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
LengthMarkdown ListMarkdown HeaderMarkdown Bold
Control Both0.2490.0310.0240.019
Control Markdown OnlyN/A0.1110.0440.056
Control Length Only0.267N/AN/AN/A
- -## Ablation - -Next, we compare the ranking changes between controlling for answer length only, markdown element only, and both. We present the Chatbot Arena Overall table first. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelRank Diff (Length Only)Rank Diff (Markdown Only)Rank Diff (Both)
chatgpt-4o-latest1->11->11->1
gemini-1.5-pro-exp-08272->22->22->2
gemini-1.5-pro-exp-08012->22->22->2
gpt-4o-2024-05-135->35->35->2
claude-3-5-sonnet-202406206->56->46->4
gemini-advanced-05147->57->87->6
grok-2-2024-08-132->42->42->5
llama-3.1-405b-instruct6->66->46->6
gpt-4o-2024-08-067->67->87->6
gpt-4-turbo-2024-04-0911->811->811->9
claude-3-opus-2024022916->1416->816->10
gemini-1.5-pro-api-051410->810->1310->10
gemini-1.5-flash-exp-08276->86->96->9
gpt-4-1106-preview16->1416->816->11
gpt-4o-mini-2024-07-186->86->116->11
gpt-4-0125-preview17->1417->1217->13
mistral-large-240716->1416->1316->13
athene-70b-072516->1616->1716->17
grok-2-mini-2024-08-136->156->156->18
gemini-1.5-pro-api-0409-preview11->1611->2111->18
- -We also perform the same comparison on Chatbot Arena Hard Prompt Category. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelRank Diff (Length Only)Rank Diff (Markdown Only)Rank Diff (Both)
chatgpt-4o-latest1->11->11->1
claude-3-5-sonnet-202406202->22->12->1
gemini-1.5-pro-exp-08272->22->22->1
gemini-1.5-pro-exp-08012->32->32->3
gpt-4o-2024-05-132->22->22->3
llama-3.1-405b-instruct4->44->24->3
grok-2-2024-08-132->32->32->4
gemini-1.5-flash-exp-08274->44->64->4
gemini-1.5-pro-api-05147->67->77->7
gpt-4o-2024-08-064->44->64->4
gemini-advanced-05149->79->79->7
claude-3-opus-2024022914->714->714->7
mistral-large-24077->77->67->7
gpt-4-1106-preview11->1011->711->7
gpt-4-turbo-2024-04-099->79->79->7
athene-70b-072511->711->811->7
gpt-4o-mini-2024-07-184->74->74->11
gpt-4-0125-preview15->1415->1015->13
grok-2-mini-2024-08-135->125->85->13
deepseek-coder-v2-072416->1416->1316->14
- - -## Future Work - -We want to continue building a pipeline to disentangle style and substance in the arena. Although controlling for style is a big step forward, our analysis is still _observational_. We are looking forward to implementing _causal inference_ in our pipeline, and running prospective randomized trials to assess the effect of length, markdown, and more. Stay tuned, and let us know if you want to help! - -## Reference - -[1] Dubois et al. “Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators”, arXiv preprint - -[2] Chen et al. “Humans or LLMs as the Judge? A Study on Judgement Bias”, arXiv preprint - -[3] Park et al. “Disentangling Length from Quality in Direct Preference Optimization”, arXiv preprint - - -## Citation -``` -@misc{chiang2024chatbot, - title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, - author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, - year={2024}, - eprint={2403.04132}, - archivePrefix={arXiv}, - primaryClass={cs.AI} -} -``` \ No newline at end of file diff --git a/public/images/blog/style_control/arena_leaderboard.png b/public/images/blog/style_control/arena_leaderboard.png deleted file mode 100644 index 2a6fca62..00000000 Binary files a/public/images/blog/style_control/arena_leaderboard.png and /dev/null differ diff --git a/public/images/blog/style_control/comparison_hard.png b/public/images/blog/style_control/comparison_hard.png deleted file mode 100644 index 72d5f997..00000000 Binary files a/public/images/blog/style_control/comparison_hard.png and /dev/null differ diff --git a/public/images/blog/style_control/comparison_overall.png b/public/images/blog/style_control/comparison_overall.png deleted file mode 100644 index 5311e265..00000000 Binary files a/public/images/blog/style_control/comparison_overall.png and /dev/null differ diff --git a/public/images/blog/style_control/logo.png b/public/images/blog/style_control/logo.png deleted file mode 100644 index e9613350..00000000 Binary files a/public/images/blog/style_control/logo.png and /dev/null differ