Skip to content

Commit

Permalink
Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…
Browse files Browse the repository at this point in the history
…rectory
  • Loading branch information
openhands-agent committed Nov 23, 2024
1 parent 1eb6d07 commit fee148e
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 3 deletions.
45 changes: 45 additions & 0 deletions CREDITS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,27 @@ OpenHands includes and adapts the following open source projects. We are gratefu
### Reference Implementations for Evaluation Benchmarks
OpenHands integrates code of the reference implementations for the following agent evaluation benchmarks:

#### [DiscoveryBench](https://github.com/FudanDISC/DiscoveryBench)
- License: Apache License 2.0

#### [ToolQA](https://github.com/night-chen/ToolQA)
- License: Apache License 2.0

#### [BioCoder](https://github.com/microsoft/biocoder)
- License: MIT License

#### [BrowsingDelegation](https://github.com/browsing-delegation/browsing-delegation)
- License: MIT License

#### [Commit0Bench](https://github.com/commit0-ai/commit0-ai)
- License: MIT License

#### [EDA](https://github.com/microsoft/EDA)
- License: MIT License

#### [AiderBench](https://github.com/paul-gauthier/aider)
- License: Apache License 2.0

#### [HumanEval](https://github.com/openai/human-eval)
- License: MIT License

Expand All @@ -49,6 +70,30 @@ OpenHands integrates code of the reference implementations for the following age
#### [GPQA](https://github.com/idavidrein/gpqa)
- License: MIT License

#### [MiniWoB](https://github.com/Farama-Foundation/miniwob-plusplus)
- License: MIT License

#### [WebArena](https://github.com/web-arena-x/webarena)
- License: Apache License 2.0

#### [MINT](https://github.com/xingyaoww/mint-bench)
- License: Apache License 2.0

#### [ML-Bench](https://github.com/gersteinlab/ML-Bench)
- License: MIT License

#### [LogicReasoning](https://github.com/google-deepmind/logic-inference-dataset)
- License: Apache License 2.0

#### [HumanEvalFix](https://github.com/amazon-science/ReWOO)
- License: Apache License 2.0

#### [ScienceAgentBench](https://github.com/OpenBMB/ScienceAgentBench)
- License: Apache License 2.0

#### [Gaia](https://github.com/microsoft/GAIA)
- License: MIT License

#### [ProntoQA](https://github.com/asaparov/prontoqa)
- License: Apache License 2.0

Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/gaia/scripts/run_infer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "LEVELS: $LEVELS"

COMMAND="poetry run python ./evaluation/gaia/run_infer.py \
COMMAND="poetry run python ./evaluation/benchmarks/gaia/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \
Expand Down
4 changes: 2 additions & 2 deletions evaluation/benchmarks/mint/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We support evaluation of the [Eurus subset focus on math and code reasoning](htt

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Start the evaluation

Expand Down Expand Up @@ -34,7 +34,7 @@ Note: in order to use `eval_limit`, you must also set `subset`.
For example,

```bash
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3
./evaluation/benchmarks/mint/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3
```

## Reference
Expand Down

0 comments on commit fee148e

Please sign in to comment.