diff --git a/blog/2024-05-02-kaggle-competition.md b/blog/2024-05-02-kaggle-competition.md
new file mode 100644
index 00000000..b0205893
--- /dev/null
+++ b/blog/2024-05-02-kaggle-competition.md
@@ -0,0 +1,21 @@
+---
+title: "LMSYS Kaggle Competition – Predicting Human Preference with $100,000 in Prizes"
+author: "LMSYS Arena Team"
+date: "May 2, 2024"
+previewImg: /images/blog/kaggle_competition/thumb_4x.png
+---
+
+### Overview
+
+LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the [Chatbot Arena](https://chat.lmsys.org), containing conversations and user preferences across various LLMs. By developing a model that accurately predicts human preferences, you'll contribute to improving chatbot performance and alignment with user expectations. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences, with personally identifiable information removed. Your solution submission will be tested on a hidden test set of 25,000 samples.
+The dataset includes real-world conversations with over 70 state-of-the-art LLMs, such as GPT-4, Claude 2, Llama 2, Gemini, and Mistral models. [Click here to join the competition](https://www.kaggle.com/competitions/lmsys-chatbot-arena/overview) and download the dataset!
+
+<img src="/images/blog/kaggle_competition/header_4x.png" style="width: 60%; max-width: 60%; margin-left: auto; margin-right: auto; margin-top: 0px; margin-bottom: 0px"></img>
+
+### Background
+
+Current LLM benchmarks often fail to capture real-world LLM usage, resulting in a discrepancy between model performance and user satisfaction. Platforms like Chatbot Arena allow users to submit questions and vote on preferred responses; however, the potential of this data has been largely untapped in developing models that predict and optimize for user preferences at scale. Predicting user preferences is essential for creating human-aligned conversational AI that delivers a satisfying user experience. Successful models could enable language models to dynamically adapt their output based on individual preferences across different contexts and use cases. Moreover, this competition aims to uncover the factors that drive user preferences beyond objective correctness. Many user questions are open-ended, and we have already found a correlation between user preference and subjective qualities like conversationality. This could also be one of the best testbeds for reward modeling in your RLHF algorithms.
+
+### Competition Details
+
+The competition will run until August 5th, **with a total prize of $100,000**, featuring a $25,000 prize for 1st place, 20,000 prizes for 2nd through 4th places, and a 15,000 prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
diff --git a/blog/2024-05-01-llama3.md b/blog/2024-05-08-llama3.md
similarity index 72%
rename from blog/2024-05-01-llama3.md
rename to blog/2024-05-08-llama3.md
index 86d21a6b..48f8b383 100644
--- a/blog/2024-05-01-llama3.md
+++ b/blog/2024-05-08-llama3.md
@@ -1,36 +1,37 @@
 ---
 title: "What’s up with Llama 3? Arena data analysis"
-author: "Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez Wei-Lin Chiang"
-date: "May 2, 2024"
+author: "Lisa Dunlap, Evan Frick, Tianle Li, Isaac Ong, Joseph E. Gonzalez, Wei-Lin Chiang"
+date: "May 8, 2024"
 previewImg: /images/blog/llama3/llama3_blog_cover.png
 ---
 
-On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English leaderboard with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus.
+On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English [Chatbot Arena leaderboard](https://leaderboard.lmsys.org) with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus.
+
+<br />
 
 We investigate the following:
 1. What types of prompts are users asking? Do users prefer Llama 3 on certain types of prompts? 
 2. How challenging are these prompts? Does the ranking change if the prompts are easier/harder?
-3. Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win-rate?
+3. Are certain users or prompts overrepresented? Do duplicate prompts or rankings from a small number of users affect the win rate?
 4. Does Llama 3 have qualitative differences which make users like it more?
 
-We focus on battles consisting of llama-3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions:
-1. Llama 3 beats other top-ranking models on open-ended writing and create problems and loses on more close-ended math and coding problems
-2. As prompts get harder*, Llama 3’s win-rate against top-tier models drops significantly. 
-3. Deduplication or outliers do not significantly affect the win-rate
-4. Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins
-
-
+We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-0409-preview) and reach the following conclusions:
+1. Llama 3 beats other top-ranking models on open-ended writing and creative problems but loses on more close-ended math and coding problems.
+2. As prompts get harder, Llama 3’s win rate against top-tier models drops significantly.
+3. Deduplication or outliers do not significantly affect the win rate.
+4. Qualitatively, Llama 3’s outputs are friendlier and more conversational than other models, and these traits appear more often in battles that Llama 3 wins.
 
+<br/>
 <img src="/images/blog/llama3/topic_win_rate.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
-<p style="color:gray; text-align: center;">Figure 1. LLama 3-70b win-rate(excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.</p>
+<p style="color:gray; text-align: center;">Figure 1. Llama 3-70b's win rate (excluding ties) against top 5 models across prompt topics. * denotes that the category contains less than 50 battles.</p>
 
 
 
-## Analyzing win-rate across different types of prompts
+## Analyzing win rate across different types of prompts
 
-**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a [pre-established taxonomy of topics](https://arxiv.org/pdf/2404.12387) and visualize the win-rate of Llama 3-70b against the other top models in Figure 1. We see that Llama’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win-rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples this remains inconclusive. 
+**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a pre-established taxonomy of topics ([from Reka's paper](https://arxiv.org/pdf/2404.12387)) and visualize the win rate of Llama 3-70b against the other top models in Figure 1. We see that Llama 3’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples, this remains inconclusive. 
 
-**Win-rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of  ``hardness' ' criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (higher score indicates a harder prompt). Our 7 criteria are:
+**win rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are:
 
 <table style="width:100%; border-collapse: collapse; border: 1px solid black;">
   <tr style="background-color: black; color: white;">
@@ -59,24 +60,24 @@ We focus on battles consisting of llama-3-70b against 5 top-ranked models (claud
   </tr>
 </table>
 
-We score 1000 battles against the top 3 models on the leaderboard and plot win-rate VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from high 50% win-rate to low 40%. We conclude that as more of these ``hardness'' criteria are met, Llama 3's win rate drop rapidly compared to other models. Notw that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.
+We score 1000 battles against the top 3 models on the leaderboard and plot their win rates VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.
 
 <img src="/images/blog/llama3/winrate-over-criteria.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%"></img>
-<p style="color:gray; text-align: center;">Figure 2. Several top model's winrate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.</p>
+<p style="color:gray; text-align: center;">Figure 2. Several top models' win rate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.</p>
 
 <img src="/images/blog/llama3/criteria_dist.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 50%"></img>
 <p style="color:gray; text-align: center;">Figure 3. The percentage of prompts with number of hardness criteria met in 3.5K sample of arena battles. We observe a significant portion of the battles are classified as hard (~27%).</p>
 
-We can further analyze which types of prompts affect win-rate by fitting a decision tree on the 7  binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree we can segment prompts into criteria subsets such that Llama 3-70b-Instruct performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win-rate the most when conditioned on.
+We can further analyze which types of prompts affect win rate by fitting a decision tree on the 7 binary columns representing if a given prompt has satisfied each of the criteria above. From this decision tree, we can segment prompts into criteria subsets such that Llama 3-70b-Instruct either performs very well or very poorly. The tree shown in Figure 4 shows us which subsets change the model’s win rate the most when conditioned on.
 
 <img src="/images/blog/llama3/dtree.svg" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 100%"></img>
-<p style="color:gray; text-align: center;">Figure 4. Llama 3-70b-Instruct win-rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.</p>
+<p style="color:gray; text-align: center;">Figure 4. Llama 3-70b-Instruct's win rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.</p>
 
-The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses.  It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks.  We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win-rate against these top models.  Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win-rate against top models when answering specific reasoning-based prompts.
+The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses.  It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks.  We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win rate against these top models.  Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win rate against top models when answering specific reasoning-based prompts.
 
 ## The effect of overrepresented prompts and judges
 
-**Effect of duplicate prompts.** Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table X that deduplication does not significantly affect Llama 3 win-rate. 
+**Effect of duplicate prompts.** Using fuzzy string matching, we find that ~9% (6658/7327) of the user prompts in battles between Llama 3 and the other top models are duplicates, and show in Table 1 that deduplication does not significantly affect Llama 3's win rate. 
 
 <style>
 th {text-align: left, text-weight: bold}
@@ -110,7 +111,7 @@ td {text-align: left}
 </table>
 
 
-**User analysis.** First we consider some basic user statistics in Table 2 to check that judging behavior is similar between Claude-3-opus-20240229 and Llama 3-70B-Instruct.
+**User analysis.** First we consider some basic user statistics in Table 2 to check that judging behavior is similar between Claude-3-Opus-20240229 and Llama 3-70B-Instruct.
 
 <br>
 <p style="color:gray; text-align: center;">Table 2. Detailed Engagement Metrics for LLMs (Timeframe: April 24 - May 1, 2023). The latest and detailed version <a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">here</a>.</p>
@@ -132,7 +133,7 @@ td {text-align: left}
 </table>
 
 
-In order to limit the impact of user’s that vote many times we can take the mean of each judge’s win rate, thereby bounding the impact of each individual judge. In this case, we find this stratified win rate shown in Table 3 is still very similar to the original winrate, suggesting that very active judges are not skewing the result.
+In order to limit the impact of users that vote many times, we can take the mean of each judge’s win rate, thereby bounding the impact of each individual judge. In this case, we find that this stratified win rate shown in Table 3 is still very similar to the original win rate, suggesting that very active judges are not skewing the result.
 
 
 <br>
@@ -140,7 +141,7 @@ In order to limit the impact of user’s that vote many times we can take the me
 <table style="display: flex; justify-content: center;">
 <tbody>
 <tr>
-<th>Model</th> <th>Win rate</th> <th>Stratified Winrate</th>
+<th>Model</th> <th>Win rate</th> <th>Stratified Win Rate</th>
 </tr>
 <tr>
 <td>Llama 3-70B-Instruct</td> <td>0.541</td> <td>0.543</td>
@@ -150,21 +151,21 @@ In order to limit the impact of user’s that vote many times we can take the me
 </tr>
 </tbody>
 </table>
-Qualitative differences between Llama 3 outputs VS other models
-From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.  
 
-**Measuring sentiment.** To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3 outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3 outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality compared to their opponents.
+**Qualitative differences between Llama 3 outputs VS other models.** From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.
+
+**Measuring sentiment.** To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3's outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3's outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality as compared to their opponents.
 
 <img src="/images/blog/llama3/llama_sentiment_distribution.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
 <p style="color:gray; text-align: center;">Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent’s output</p>
 
-**Is sentiment related to win-rate?** Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights.
+**Is sentiment related to win rate?** Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights.
 
 <img src="/images/blog/llama3/sentiment_win_rate.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
 <p style="color:gray; text-align: center;">Figure 6: Llama 3 sentiment VS win rate which llama is more positive/friendly/conversational/exclamatory than its opponent’s output.</p>
 
 ## Conclusion
-From the beginning, our mission has been to advance LLM development and understanding. In the past we have focused on high-level ranking and benchmark design.  Moving forward we hope to extend the analysis here and conduct more in depth analysis into changes in human preference as well as model behavior.  
+From the beginning, our mission has been to advance LLM development and understanding. While in the past we have focused on high-level ranking and benchmark design, moving forward, we hope to extend the analysis here and conduct more in-depth analysis into changes in human preference as well as model behavior.  
 
 
 ## Acknowledgment
diff --git a/content/about.md b/content/about.md
index 93ad0e40..4fc547db 100644
--- a/content/about.md
+++ b/content/about.md
@@ -8,7 +8,7 @@ We aim to make large models accessible to everyone by co-development of open mod
 
 ### Members
 **Student Team**  
-[Lianmin Zheng](https://lmzheng.net/), [Ying Sheng](https://sites.google.com/view/yingsheng/home), [Wei-Lin Chiang](https://infwinston.github.io/), [Shiyi Cao](https://shiyicao.com/), [Tianle Li](https://codingwithtim.github.io/), [Christopher Chou](https://github.com/BabyChouSr), [Dacheng Li](https://dachengli1.github.io/), [Zhuohan Li](https://people.eecs.berkeley.edu/~zhuohan/), [Zi Lin](https://zi-lin.com/), [Zhanghao Wu](https://zhanghaowu.me/), [Shuo Yang](https://github.com/andy-yang-1), [Siyuan Zhuang](https://github.com/suquark), [Yonghao Zhuang](https://github.com/ZYHowell), [Lisa Dunlap](https://lisabdunlap.com)
+[Lianmin Zheng](https://lmzheng.net/), [Ying Sheng](https://sites.google.com/view/yingsheng/home), [Wei-Lin Chiang](https://infwinston.github.io/), [Lisa Dunlap](https://lisabdunlap.com), [Shiyi Cao](https://shiyicao.com/), [Tianle Li](https://codingwithtim.github.io/), [Christopher Chou](https://github.com/BabyChouSr), [Dacheng Li](https://dachengli1.github.io/), [Zhuohan Li](https://people.eecs.berkeley.edu/~zhuohan/), [Zi Lin](https://zi-lin.com/), [Zhanghao Wu](https://zhanghaowu.me/), [Shuo Yang](https://github.com/andy-yang-1), [Siyuan Zhuang](https://github.com/suquark), [Yonghao Zhuang](https://github.com/ZYHowell)
 
 **Faculty Team**  
 [Joseph E. Gonzalez](https://people.eecs.berkeley.edu/~jegonzal/), [Ion Stoica](https://people.eecs.berkeley.edu/~istoica/), [Eric P. Xing](http://www.cs.cmu.edu/~epxing/), [Hao Zhang](https://people.eecs.berkeley.edu/~hao/)
diff --git a/public/images/blog/kaggle_competition/cover_image.jpg b/public/images/blog/kaggle_competition/cover_image.jpg
new file mode 100644
index 00000000..07a85cd6
Binary files /dev/null and b/public/images/blog/kaggle_competition/cover_image.jpg differ
diff --git a/public/images/blog/kaggle_competition/header_4x.png b/public/images/blog/kaggle_competition/header_4x.png
new file mode 100644
index 00000000..428e9509
Binary files /dev/null and b/public/images/blog/kaggle_competition/header_4x.png differ
diff --git a/public/images/blog/kaggle_competition/thumb_4x.png b/public/images/blog/kaggle_competition/thumb_4x.png
new file mode 100644
index 00000000..8c7e9ed4
Binary files /dev/null and b/public/images/blog/kaggle_competition/thumb_4x.png differ