Skip to content

Commit

Permalink
Update paths to reference evaluation/benchmarks/ directory for benchm…
Browse files Browse the repository at this point in the history
…arks while keeping other directories directly under evaluation/
  • Loading branch information
openhands-agent committed Nov 23, 2024
1 parent 2a3460c commit ecb1d81
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 23 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/eval-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,12 @@ jobs:
EVAL_DOCKER_IMAGE_PREFIX: us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images

run: |
poetry run ./evaluation/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
OUTPUT_FOLDER=$(find evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent -name "deepseek-chat_maxiter_50_N_*-no-hint-run_1" -type d | head -n 1)
echo "OUTPUT_FOLDER for SWE-bench evaluation: $OUTPUT_FOLDER"
poetry run ./evaluation/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
poetry run ./evaluation/benchmarks/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
echo "SWEBENCH_REPORT<<EOF" >> $GITHUB_ENV
cat summarize_outputs.log >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ La fonction `run_controller()` est le cœur de l'exécution d'OpenHands. Elle g

## Le moyen le plus simple de commencer : Explorer les benchmarks existants

Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) de notre dépôt.
Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/benchmarks/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) de notre dépôt.

Pour intégrer votre propre benchmark, nous vous suggérons de commencer par celui qui ressemble le plus à vos besoins. Cette approche peut considérablement rationaliser votre processus d'intégration, vous permettant de vous appuyer sur les structures existantes et de les adapter à vos exigences spécifiques.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它工

## 入门最简单的方法:探索现有基准

我们鼓励您查看我们仓库的 [`evaluation/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation)中提供的各种评估基准。
我们鼓励您查看我们仓库的 [`evaluation/benchmarks/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks)中提供的各种评估基准。

要集成您自己的基准,我们建议从最接近您需求的基准开始。这种方法可以显著简化您的集成过程,允许您在现有结构的基础上进行构建并使其适应您的特定要求。

Expand Down
2 changes: 1 addition & 1 deletion docs/modules/usage/how-to/evaluation-harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ The `run_controller()` function is the core of OpenHands's execution. It manages

## Easiest way to get started: Exploring Existing Benchmarks

We encourage you to review the various evaluation benchmarks available in the [`evaluation/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) of our repository.
We encourage you to review the various evaluation benchmarks available in the [`evaluation/benchmarks/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) of our repository.

To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.

Expand Down
34 changes: 17 additions & 17 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,28 +46,28 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across so

### Software Engineering

- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
- BIRD: [`evaluation/bird`](./bird)
- BioCoder: [`evaluation/ml_bench`](./ml_bench)
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
- APIBench: [`evaluation/gorilla`](./gorilla/)
- ToolQA: [`evaluation/toolqa`](./toolqa/)
- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)

### Web Browsing

- WebArena: [`evaluation/webarena`](./webarena/)
- MiniWob++: [`evaluation/miniwob`](./miniwob/)
- WebArena: [`evaluation/benchmarks/webarena`](./benchmarks/webarena/)
- MiniWob++: [`evaluation/benchmarks/miniwob`](./benchmarks/miniwob/)

### Misc. Assistance

- GAIA: [`evaluation/gaia`](./gaia)
- GPQA: [`evaluation/gpqa`](./gpqa)
- AgentBench: [`evaluation/agent_bench`](./agent_bench)
- MINT: [`evaluation/mint`](./mint)
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
- GAIA: [`evaluation/benchmarks/gaia`](./benchmarks/gaia)
- GPQA: [`evaluation/benchmarks/gpqa`](./benchmarks/gpqa)
- AgentBench: [`evaluation/benchmarks/agent_bench`](./benchmarks/agent_bench)
- MINT: [`evaluation/benchmarks/mint`](./benchmarks/mint)
- Entity deduction Arena (EDA): [`evaluation/benchmarks/EDA`](./benchmarks/EDA)
- ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning)

## Result Visualization

Expand All @@ -79,7 +79,7 @@ You can start your own fork of [our huggingface evaluation outputs](https://hugg

To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). Briefly,

- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/benchmarks/swe_bench` should contain
all the preprocessing/evaluation/analysis scripts.
- Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.
Expand Down

0 comments on commit ecb1d81

Please sign in to comment.