From e52aee168e926d7cd60c78cf43aba464eaa4101e Mon Sep 17 00:00:00 2001 From: Xingyao Wang Date: Fri, 21 Feb 2025 01:16:17 -0500 Subject: [PATCH] Docs: Clarify config.toml usage in evaluation harness (#6828) Co-authored-by: openhands --- evaluation/README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/evaluation/README.md b/evaluation/README.md index b8da72f8c5ee..cfaf1ba36c4d 100644 --- a/evaluation/README.md +++ b/evaluation/README.md @@ -20,6 +20,8 @@ To evaluate an agent, you can provide the agent's name to the `run_infer.py` pro ### Evaluating Different LLMs OpenHands in development mode uses `config.toml` to keep track of most configuration. +**IMPORTANT: For evaluation, only the LLM section in `config.toml` will be used. Other configurations, such as `save_trajectory_path`, are not applied during evaluation.** + Here's an example configuration file you can use to define and use multiple LLMs: ```toml @@ -40,6 +42,8 @@ api_key = "XXX" temperature = 0.0 ``` +For other configurations specific to evaluation, such as `save_trajectory_path`, these are typically set in the `get_config` function of the respective `run_infer.py` file for each benchmark. + ## Supported Benchmarks The OpenHands evaluation harness supports a wide variety of benchmarks across [software engineering](#software-engineering), [web browsing](#web-browsing), [miscellaneous assistance](#misc-assistance), and [real-world](#real-world) tasks.