diff --git a/blog/2024-05-02-kaggle-competition.md b/blog/2024-05-02-kaggle-competition.md new file mode 100644 index 00000000..b0205893 --- /dev/null +++ b/blog/2024-05-02-kaggle-competition.md @@ -0,0 +1,21 @@ +--- +title: "LMSYS Kaggle Competition – Predicting Human Preference with $100,000 in Prizes" +author: "LMSYS Arena Team" +date: "May 2, 2024" +previewImg: /images/blog/kaggle_competition/thumb_4x.png +--- + +### Overview + +LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the [Chatbot Arena](https://chat.lmsys.org), containing conversations and user preferences across various LLMs. By developing a model that accurately predicts human preferences, you'll contribute to improving chatbot performance and alignment with user expectations. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences, with personally identifiable information removed. Your solution submission will be tested on a hidden test set of 25,000 samples. +The dataset includes real-world conversations with over 70 state-of-the-art LLMs, such as GPT-4, Claude 2, Llama 2, Gemini, and Mistral models. [Click here to join the competition](https://www.kaggle.com/competitions/lmsys-chatbot-arena/overview) and download the dataset! + + + +### Background + +Current LLM benchmarks often fail to capture real-world LLM usage, resulting in a discrepancy between model performance and user satisfaction. Platforms like Chatbot Arena allow users to submit questions and vote on preferred responses; however, the potential of this data has been largely untapped in developing models that predict and optimize for user preferences at scale. Predicting user preferences is essential for creating human-aligned conversational AI that delivers a satisfying user experience. Successful models could enable language models to dynamically adapt their output based on individual preferences across different contexts and use cases. Moreover, this competition aims to uncover the factors that drive user preferences beyond objective correctness. Many user questions are open-ended, and we have already found a correlation between user preference and subjective qualities like conversationality. This could also be one of the best testbeds for reward modeling in your RLHF algorithms. + +### Competition Details + +The competition will run until August 5th, **with a total prize of $100,000**, featuring a $25,000 prize for 1st place, 20,000 prizes for 2nd through 4th places, and a 15,000 prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI. diff --git a/blog/2024-05-01-llama3.md b/blog/2024-05-08-llama3.md similarity index 72% rename from blog/2024-05-01-llama3.md rename to blog/2024-05-08-llama3.md index 86d21a6b..48f8b383 100644 --- a/blog/2024-05-01-llama3.md +++ b/blog/2024-05-08-llama3.md @@ -1,36 +1,37 @@ --- title: "What’s up with Llama 3? Arena data analysis" -author: "Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez Wei-Lin Chiang" -date: "May 2, 2024" +author: "Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang" +date: "May 8, 2024" previewImg: /images/blog/llama3/llama3_blog_cover.png --- -On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English leaderboard with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus. +On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English [Chatbot Arena leaderboard](https://leaderboard.lmsys.org) with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus. + +
We investigate the following: 1. What types of prompts are users asking? Do users prefer Llama 3 on certain types of prompts? 2. How challenging are these prompts? Does the ranking change if the prompts are easier/harder? -3. Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win-rate? +3. Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win rate? 4. Does Llama 3 have qualitative differences which make users like it more? -We focus on battles consisting of llama-3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions: -1. Llama 3 beats other top-ranking models on open-ended writing and create problems and loses on more close-ended math and coding problems -2. As prompts get harder*, Llama 3’s win-rate against top-tier models drops significantly. -3. Deduplication or outliers do not significantly affect the win-rate -4. Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins - - +We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions: +1. Llama 3 beats other top-ranking models on open-ended writing and creative problems but loses on more close-ended math and coding problems. +2. As prompts get harder, Llama 3’s win rate against top-tier models drops significantly. +3. Deduplication or outliers do not significantly affect the win rate. +4. Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins. +
-

Figure 1. LLama 3-70b win-rate(excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.

+

Figure 1. Llama 3-70b's win rate (excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.

-## Analyzing win-rate across different types of prompts +## Analyzing win rate across different types of prompts -**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a [pre-established taxonomy of topics](https://arxiv.org/pdf/2404.12387) and visualize the win-rate of Llama 3-70b against the other top models in Figure 1. We see that Llama’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win-rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples this remains inconclusive. +**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a pre-established taxonomy of topics ([from Reka's paper](https://arxiv.org/pdf/2404.12387)) and visualize the win rate of Llama 3-70b against the other top models in Figure 1. We see that Llama 3’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples, this remains inconclusive. -**Win-rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of ``hardness' ' criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (higher score indicates a harder prompt). Our 7 criteria are: +**win rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are: @@ -59,24 +60,24 @@ We focus on battles consisting of llama-3-70b against 5 top-ranked models (claud
-We score 1000 battles against the top 3 models on the leaderboard and plot win-rate VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from high 50% win-rate to low 40%. We conclude that as more of these ``hardness'' criteria are met, Llama 3's win rate drop rapidly compared to other models. Notw that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion. +We score 1000 battles against the top 3 models on the leaderboard and plot their win rates VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion. -

Figure 2. Several top model's winrate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.

+

Figure 2. Several top models' win rate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.

Figure 3. The percentage of prompts with number of hardness criteria met in 3.5K sample of arena battles. We observe a significant portion of the battles are classified as hard (~27%).

-We can further analyze which types of prompts affect win-rate by fitting a decision tree on the 7 binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree we can segment prompts into criteria subsets such that Llama 3-70b-Instruct performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win-rate the most when conditioned on. +We can further analyze which types of prompts affect win rate by fitting a decision tree on the 7 binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree, we can segment prompts into criteria subsets such that Llama 3-70b-Instruct either performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win rate the most when conditioned on. -

Figure 4. Llama 3-70b-Instruct win-rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.

+

Figure 4. Llama 3-70b-Instruct's win rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.

-The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks. We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win-rate against these top models. Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win-rate against top models when answering specific reasoning-based prompts. +The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks. We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win rate against these top models. Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win rate against top models when answering specific reasoning-based prompts. ## The effect of overrepresented prompts and judges -**Effect of duplicate prompts.** Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table X that deduplication does not significantly affect Llama 3 win-rate. +**Effect of duplicate prompts.** Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table 1 that deduplication does not significantly affect Llama 3's win rate.