From 090771674cd07fbcc471779eab3c88fed48ef0b9 Mon Sep 17 00:00:00 2001
From: Graham Neubig <neubig@gmail.com>
Date: Sat, 9 Nov 2024 22:12:09 -0500
Subject: [PATCH] Update llms.md w/ more recent results (#4874)

---
 docs/modules/usage/llms/llms.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/modules/usage/llms/llms.md b/docs/modules/usage/llms/llms.md
index 75d19296270b..3ce773fcc15e 100644
--- a/docs/modules/usage/llms/llms.md
+++ b/docs/modules/usage/llms/llms.md
@@ -4,11 +4,11 @@ OpenHands can connect to any LLM supported by LiteLLM. However, it requires a po
 
 ## Model Recommendations
 
-Based on a recent evaluation of language models for coding tasks (using the SWE-bench dataset), we can provide some recommendations for model selection. The full analysis can be found in [this blog article](https://www.all-hands.dev/blog/evaluation-of-llms-as-coding-agents-on-swe-bench-at-30x-speed).
+Based on our evaluations of language models for coding tasks (using the SWE-bench dataset), we can provide some recommendations for model selection. Some analyses can be found in [this blog article comparing LLMs](https://www.all-hands.dev/blog/evaluation-of-llms-as-coding-agents-on-swe-bench-at-30x-speed) and [this blog article with some more recent results](https://www.all-hands.dev/blog/openhands-codeact-21-an-open-state-of-the-art-software-development-agent).
 
 When choosing a model, consider both the quality of outputs and the associated costs. Here's a summary of the findings:
 
-- Claude 3.5 Sonnet is the best by a fair amount, achieving a 27% resolve rate with the default agent in OpenHands.
+- Claude 3.5 Sonnet is the best by a fair amount, achieving a 53% resolve rate on SWE-Bench Verified with the default agent in OpenHands.
 - GPT-4o lags behind, and o1-mini actually performed somewhat worse than GPT-4o. We went in and analyzed the results a little, and briefly it seemed like o1 was sometimes "overthinking" things, performing extra environment configuration tasks when it could just go ahead and finish the task.
 - Finally, the strongest open models were Llama 3.1 405 B and deepseek-v2.5, and they performed reasonably, even besting some of the closed models.