Skip to content

Commit

Permalink
Merge pull request #81 from lm-sys/llama3-fixes
Browse files Browse the repository at this point in the history
Llama 3 blog fixes
  • Loading branch information
lisadunlap authored May 8, 2024
2 parents 1d8a179 + dd07cfb commit c1a555e
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 28 deletions.
55 changes: 28 additions & 27 deletions blog/2024-05-01-llama3.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,37 @@
---
title: "What’s up with Llama 3? Arena data analysis"
author: "Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez Wei-Lin Chiang"
author: "Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang"
date: "May 2, 2024"
previewImg: /images/blog/llama3/llama3_blog_cover.png
---

On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English leaderboard with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus.

<br />

We investigate the following:
1. What types of prompts are users asking? Do users prefer Llama 3 on certain types of prompts?
2. How challenging are these prompts? Does the ranking change if the prompts are easier/harder?
3. Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win-rate?
3. Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win rate?
4. Does Llama 3 have qualitative differences which make users like it more?

We focus on battles consisting of llama-3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions:
1. Llama 3 beats other top-ranking models on open-ended writing and create problems and loses on more close-ended math and coding problems
2. As prompts get harder*, Llama 3’s win-rate against top-tier models drops significantly.
3. Deduplication or outliers do not significantly affect the win-rate
4. Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins


We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions:
1. Llama 3 beats other top-ranking models on open-ended writing and creative problems but loses on more close-ended math and coding problems.
2. As prompts get harder, Llama 3’s win rate against top-tier models drops significantly.
3. Deduplication or outliers do not significantly affect the win rate.
4. Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins.

<br/>
<img src="/images/blog/llama3/topic_win_rate.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
<p style="color:gray; text-align: center;">Figure 1. LLama 3-70b win-rate(excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.</p>
<p style="color:gray; text-align: center;">Figure 1. Llama 3-70b's win rate (excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.</p>



## Analyzing win-rate across different types of prompts
## Analyzing win rate across different types of prompts

**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a [pre-established taxonomy of topics](https://arxiv.org/pdf/2404.12387) and visualize the win-rate of Llama 3-70b against the other top models in Figure 1. We see that Llama’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win-rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples this remains inconclusive.
**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a pre-established taxonomy of topics ([from Reka's paper](https://arxiv.org/pdf/2404.12387)) and visualize the win rate of Llama 3-70b against the other top models in Figure 1. We see that Llama 3’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples, this remains inconclusive.

**Win-rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of ``hardness' ' criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (higher score indicates a harder prompt). Our 7 criteria are:
**win rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are:

<table style="width:100%; border-collapse: collapse; border: 1px solid black;">
<tr style="background-color: black; color: white;">
Expand Down Expand Up @@ -59,24 +60,24 @@ We focus on battles consisting of llama-3-70b against 5 top-ranked models (claud
</tr>
</table>

We score 1000 battles against the top 3 models on the leaderboard and plot win-rate VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from high 50% win-rate to low 40%. We conclude that as more of these ``hardness'' criteria are met, Llama 3's win rate drop rapidly compared to other models. Notw that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.
We score 1000 battles against the top 3 models on the leaderboard and plot their win rates VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.

<img src="/images/blog/llama3/winrate-over-criteria.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%"></img>
<p style="color:gray; text-align: center;">Figure 2. Several top model's winrate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.</p>
<p style="color:gray; text-align: center;">Figure 2. Several top models' win rate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.</p>

<img src="/images/blog/llama3/criteria_dist.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 50%"></img>
<p style="color:gray; text-align: center;">Figure 3. The percentage of prompts with number of hardness criteria met in 3.5K sample of arena battles. We observe a significant portion of the battles are classified as hard (~27%).</p>

We can further analyze which types of prompts affect win-rate by fitting a decision tree on the 7 binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree we can segment prompts into criteria subsets such that Llama 3-70b-Instruct performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win-rate the most when conditioned on.
We can further analyze which types of prompts affect win rate by fitting a decision tree on the 7 binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree, we can segment prompts into criteria subsets such that Llama 3-70b-Instruct either performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win rate the most when conditioned on.

<img src="/images/blog/llama3/dtree.svg" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 100%"></img>
<p style="color:gray; text-align: center;">Figure 4. Llama 3-70b-Instruct win-rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.</p>
<p style="color:gray; text-align: center;">Figure 4. Llama 3-70b-Instruct's win rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.</p>

The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks. We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win-rate against these top models. Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win-rate against top models when answering specific reasoning-based prompts.
The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks. We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win rate against these top models. Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win rate against top models when answering specific reasoning-based prompts.

## The effect of overrepresented prompts and judges

**Effect of duplicate prompts.** Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table X that deduplication does not significantly affect Llama 3 win-rate.
**Effect of duplicate prompts.** Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table 1 that deduplication does not significantly affect Llama 3's win rate.

<style>
th {text-align: left, text-weight: bold}
Expand Down Expand Up @@ -110,7 +111,7 @@ td {text-align: left}
</table>


**User analysis.** First we consider some basic user statistics in Table 2 to check that judging behavior is similar between Claude-3-opus-20240229 and Llama 3-70B-Instruct.
**User analysis.** First we consider some basic user statistics in Table 2 to check that judging behavior is similar between Claude-3-Opus-20240229 and Llama 3-70B-Instruct.

<br>
<p style="color:gray; text-align: center;">Table 2. Detailed Engagement Metrics for LLMs (Timeframe: April 24 - May 1, 2023). The latest and detailed version <a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">here</a>.</p>
Expand All @@ -132,15 +133,15 @@ td {text-align: left}
</table>


In order to limit the impact of user’s that vote many times we can take the mean of each judge’s win rate, thereby bounding the impact of each individual judge. In this case, we find this stratified win rate shown in Table 3 is still very similar to the original winrate, suggesting that very active judges are not skewing the result.
In order to limit the impact of users that vote many times, we can take the mean of each judge’s win rate, thereby bounding the impact of each individual judge. In this case, we find that this stratified win rate shown in Table 3 is still very similar to the original win rate, suggesting that very active judges are not skewing the result.


<br>
<p style="color:gray; text-align: center;">Table 3. Model Win Rates (Timeframe: April 24 - May 1, 2023). The latest and detailed version <a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">here</a>. Note that ties are counted as 0.5, with wins and losses as 1 and 0, respectively.</p>
<table style="display: flex; justify-content: center;">
<tbody>
<tr>
<th>Model</th> <th>Win rate</th> <th>Stratified Winrate</th>
<th>Model</th> <th>Win rate</th> <th>Stratified Win Rate</th>
</tr>
<tr>
<td>Llama 3-70B-Instruct</td> <td>0.541</td> <td>0.543</td>
Expand All @@ -150,21 +151,21 @@ In order to limit the impact of user’s that vote many times we can take the me
</tr>
</tbody>
</table>
Qualitative differences between Llama 3 outputs VS other models
From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.

**Measuring sentiment.** To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3 outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3 outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality compared to their opponents.
**Qualitative differences between Llama 3 outputs VS other models.** From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.

**Measuring sentiment.** To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3's outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3's outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality as compared to their opponents.

<img src="/images/blog/llama3/llama_sentiment_distribution.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
<p style="color:gray; text-align: center;">Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent’s output</p>

**Is sentiment related to win-rate?** Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights.
**Is sentiment related to win rate?** Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights.

<img src="/images/blog/llama3/sentiment_win_rate.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
<p style="color:gray; text-align: center;">Figure 6: Llama 3 sentiment VS win rate which llama is more positive/friendly/conversational/exclamatory than its opponent’s output.</p>

## Conclusion
From the beginning, our mission has been to advance LLM development and understanding. In the past we have focused on high-level ranking and benchmark design. Moving forward we hope to extend the analysis here and conduct more in depth analysis into changes in human preference as well as model behavior.
From the beginning, our mission has been to advance LLM development and understanding. While in the past we have focused on high-level ranking and benchmark design, moving forward, we hope to extend the analysis here and conduct more in-depth analysis into changes in human preference as well as model behavior.


## Acknowledgment
Expand Down
2 changes: 1 addition & 1 deletion content/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ We aim to make large models accessible to everyone by co-development of open mod

### Members
**Student Team**
[Lianmin Zheng](https://lmzheng.net/), [Ying Sheng](https://sites.google.com/view/yingsheng/home), [Wei-Lin Chiang](https://infwinston.github.io/), [Lisa Dunlap](https://lisabdunlap.com), [Shiyi Cao](https://shiyicao.com/), [Tianle Li](https://codingwithtim.github.io/), [Christopher Chou](https://github.com/BabyChouSr), [Dacheng Li](https://dachengli1.github.io/), [Zhuohan Li](https://people.eecs.berkeley.edu/~zhuohan/), [Zi Lin](https://zi-lin.com/), [Zhanghao Wu](https://zhanghaowu.me/), [Shuo Yang](https://github.com/andy-yang-1), [Siyuan Zhuang](https://github.com/suquark), [Yonghao Zhuang](https://github.com/ZYHowell), [Lisa Dunlap](https://lisabdunlap.com)
[Lianmin Zheng](https://lmzheng.net/), [Ying Sheng](https://sites.google.com/view/yingsheng/home), [Wei-Lin Chiang](https://infwinston.github.io/), [Lisa Dunlap](https://lisabdunlap.com), [Shiyi Cao](https://shiyicao.com/), [Tianle Li](https://codingwithtim.github.io/), [Christopher Chou](https://github.com/BabyChouSr), [Dacheng Li](https://dachengli1.github.io/), [Zhuohan Li](https://people.eecs.berkeley.edu/~zhuohan/), [Zi Lin](https://zi-lin.com/), [Zhanghao Wu](https://zhanghaowu.me/), [Shuo Yang](https://github.com/andy-yang-1), [Siyuan Zhuang](https://github.com/suquark), [Yonghao Zhuang](https://github.com/ZYHowell)

**Faculty Team**
[Joseph E. Gonzalez](https://people.eecs.berkeley.edu/~jegonzal/), [Ion Stoica](https://people.eecs.berkeley.edu/~istoica/), [Eric P. Xing](http://www.cs.cmu.edu/~epxing/), [Hao Zhang](https://people.eecs.berkeley.edu/~hao/)
Expand Down

0 comments on commit c1a555e

Please sign in to comment.