diff --git a/blog/2024-05-08-llama3.md b/blog/2024-05-08-llama3.md index 28ddd39a..5f8996e9 100644 --- a/blog/2024-05-08-llama3.md +++ b/blog/2024-05-08-llama3.md @@ -73,7 +73,7 @@ We can further analyze which types of prompts affect win rate by fitting a decis

Figure 4. Llama 3-70b-Instruct's win rate conditioned on hierarchical prompt criteria subsets as fitted using a standard decision tree algorithm.

-The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria already divides Llama 3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama 3-70b-Instruct is stronger on open-ended prompts (not specific) rather than more objective tasks. We can traverse further down the tree and see that Llama 3-70b-Instruct is quite strong on open-ended creative prompts (see the blue path), reaching around a 60% win rate against these top models. Following the orange path, we notice that Llama 3-70b-Instruct has a much lower win rate against top models when answering specific reasoning-based prompts. +The first thing to notice is that “Specificity” is the root node of the tree, suggesting that this criteria most immediately divides Llama3-70b-Instruct’s performance into its strengths and weaknesses. It supports our initial findings above that Llama3-70b-Instruct is stronger on open-ended tasks rather than more closed-ended tasks. We can traverse further down the tree and see that Llama3-70b-Instruct is quite strong on open-ended creative questions (see the blue path), reaching around a 60% win-rate against these top models. Emperically, these types of questions are often writing and brainstorming style questions. For example two prompts where Llama-3-70B-Instruct won are: "Write the first chapter of a novel." and "Could you provide two story suggestions for children that promote altruism? ". On the other hand, following the orange path, we can notice that Llama3-70b-Instruct has a lower win-rate against top models when answering close-ended, non-real-world, reasoning-based questions. These questions are often logic puzzles and math word word problems. Two examples where Llama-3-70B-Instruct won are: "123x = -4x * 2 - 65" and "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" ## The effect of overrepresented prompts and judges