Skip to content

freshness blog post #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 19, 2025
Merged

freshness blog post #29

merged 6 commits into from
Apr 19, 2025

Conversation

lisadunlap
Copy link
Contributor

No description provided.


Given a prompt at submitted at time $$t$$, we examine the following:

- Nearest neighbor with all prompts submitted before time $$t$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make this clearer, e.g.

  • The largest similarity between the current prompt and any prompt submitted before time $$t$$
  • The largest similarity between the current prompt and any prompt submitted at least one day before time $$t$$
  • ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited in cd539a3


## How do we measure prompt duplicates?

Prompt duplicates are measured by the cosine similarity of the text embeddings (OpenAI's text-embedding-3-small). If the similarity between the embeddings of prompt a and prompt b are greater than or equal to 0.7, we consider it a duplicate. This threshold is set by manually looking through examples to determine when two prompts are asking the same thing. A random sample of prompt pairs with their similarities are provided on our hugging face.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hugging Face? HuggingFace? Don't know what's the right way to capitalize/space this lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited in cd539a3

}
</script>

While we do see a downward trend in proportion of unique prompts over time, this decrease is plateauing. Interestingly, we also see certain dates where prompt freshness is significantly lower than neighboring dates: we will get to why that is in the next section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say something a little more descriptive here? E.g.

"If you look at the above analysis, the proportion of fresh prompts decreases as a function of $$t$$. This is expected, since as $$t$$ grows, we are comparing new prompts with an ever-larger set of past prompts. For example, when $$t=1$$, there are no previos prompts, so of course, the freshness is 100%.

However, as $$t$$ grows, this number stabilizes to around 70-80% fresh prompts at a similarity threshold of 0.7. This equilibrium represents the fraction of fresh prompts that we expect chatbot arena to generate in the long run."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited in cd539a3

@aangelopoulos aangelopoulos merged commit 0b68fd7 into lmarena:main Apr 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants