Fix issue #4225: Add evaluation data to the LLMs docs (#4312)

Co-authored-by: Graham Neubig <[email protected]>
All-Hands-AI · Oct 13, 2024 · ff8a9a1 · ff8a9a1
1 parent edcc391
commit ff8a9a1
Showing 1 changed file with 38 additions and 24 deletions.
diff --git a/docs/modules/usage/llms/llms.md b/docs/modules/usage/llms/llms.md
@@ -1,12 +1,25 @@
 # 🤖 LLM Backends
 
 OpenHands can connect to any LLM supported by LiteLLM. However, it requires a powerful model to work.
-The following are verified by the community to work with OpenHands:
 
-* claude-3-5-sonnet (recommended)
-* gemini-1.5-pro / gemini-1.5-flash
-* gpt-4 / gpt-4o
-* llama-3.1-405b / hermes-3-llama-3.1-405b
+## Model Recommendations
+
+Based on a recent evaluation of language models for coding tasks (using the SWE-bench dataset), we can provide some recommendations for model selection. The full analysis can be found in [this blog article](https://www.all-hands.dev/blog/evaluation-of-llms-as-coding-agents-on-swe-bench-at-30x-speed).
+
+When choosing a model, consider both the quality of outputs and the associated costs. Here's a summary of the findings:
+
+- Claude 3.5 Sonnet is the best by a fair amount, achieving a 27% resolve rate with the default agent in OpenHands.
+- GPT-4o lags behind, and o1-mini actually performed somewhat worse than GPT-4o. We went in and analyzed the results a little, and briefly it seemed like o1 was sometimes "overthinking" things, performing extra environment configuration tasks when it could just go ahead and finish the task.
+- Finally, the strongest open models were Llama 3.1 405 B and deepseek-v2.5, and they performed reasonably, even besting some of the closed models.
+
+Please refer to the [full article](https://www.all-hands.dev/blog/evaluation-of-llms-as-coding-agents-on-swe-bench-at-30x-speed) for more details.
+
+Based on these findings and community feedback, the following models have been verified to work reasonably well with OpenHands:
+
+- claude-3-5-sonnet (recommended)
+- gpt-4 / gpt-4o
+- llama-3.1-405b
+- deepseek-v2.5
 
 :::warning
 OpenHands will issue many prompts to the LLM you configure. Most of these LLMs cost money, so be sure to set spending
@@ -28,40 +41,41 @@ models driving it. However, if you do find ones that work, please add them to th
 ## LLM Configuration
 
 The following can be set in the OpenHands UI through the Settings:
-* `LLM Provider`
-* `LLM Model`
-* `API Key`
-* `Base URL` (through `Advanced Settings`)
+
+- `LLM Provider`
+- `LLM Model`
+- `API Key`
+- `Base URL` (through `Advanced Settings`)
 
 There are some settings that may be necessary for some LLMs/providers that cannot be set through the UI. Instead, these
 can be set through environment variables passed to the [docker run command](/modules/usage/installation)
 using `-e`:
 
-* `LLM_API_VERSION`
-* `LLM_EMBEDDING_MODEL`
-* `LLM_EMBEDDING_DEPLOYMENT_NAME`
-* `LLM_DROP_PARAMS`
-* `LLM_DISABLE_VISION`
-* `LLM_CACHING_PROMPT`
+- `LLM_API_VERSION`
+- `LLM_EMBEDDING_MODEL`
+- `LLM_EMBEDDING_DEPLOYMENT_NAME`
+- `LLM_DROP_PARAMS`
+- `LLM_DISABLE_VISION`
+- `LLM_CACHING_PROMPT`
 
 We have a few guides for running OpenHands with specific model providers:
 
-* [Azure](llms/azure-llms)
-* [Google](llms/google-llms)
-* [Groq](llms/groq)
-* [OpenAI](llms/openai-llms)
-* [OpenRouter](llms/openrouter)
+- [Azure](llms/azure-llms)
+- [Google](llms/google-llms)
+- [Groq](llms/groq)
+- [OpenAI](llms/openai-llms)
+- [OpenRouter](llms/openrouter)
 
 ### API retries and rate limits
 
 LLM providers typically have rate limits, sometimes very low, and may require retries. OpenHands will automatically retry requests if it receives a Rate Limit Error (429 error code), API connection error, or other transient errors.
 
 You can customize these options as you need for the provider you're using. Check their documentation, and set the following environment variables to control the number of retries and the time between retries:
 
-* `LLM_NUM_RETRIES` (Default of 8)
-* `LLM_RETRY_MIN_WAIT` (Default of 15 seconds)
-* `LLM_RETRY_MAX_WAIT` (Default of 120 seconds)
-* `LLM_RETRY_MULTIPLIER` (Default of 2)
+- `LLM_NUM_RETRIES` (Default of 8)
+- `LLM_RETRY_MIN_WAIT` (Default of 15 seconds)
+- `LLM_RETRY_MAX_WAIT` (Default of 120 seconds)
+- `LLM_RETRY_MULTIPLIER` (Default of 2)
 
 If you are running OpenHands in development mode, you can also set these options in the `config.toml` file: