diff --git a/.github/workflows/eval-runner.yml b/.github/workflows/eval-runner.yml
index f788cf78d2f8..087a1a6651ec 100644
--- a/.github/workflows/eval-runner.yml
+++ b/.github/workflows/eval-runner.yml
@@ -86,12 +86,12 @@ jobs:
           EVAL_DOCKER_IMAGE_PREFIX: us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images
 
         run: |
-          poetry run ./evaluation/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
+          poetry run ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
           OUTPUT_FOLDER=$(find evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent -name "deepseek-chat_maxiter_50_N_*-no-hint-run_1" -type d | head -n 1)
           echo "OUTPUT_FOLDER for SWE-bench evaluation: $OUTPUT_FOLDER"
-          poetry run ./evaluation/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
+          poetry run ./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
 
-          poetry run ./evaluation/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
+          poetry run ./evaluation/benchmarks/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
           echo "SWEBENCH_REPORT<<EOF" >> $GITHUB_ENV
           cat summarize_outputs.log >> $GITHUB_ENV
           echo "EOF" >> $GITHUB_ENV
diff --git a/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md b/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
index 3f191053998f..b215a1ca511f 100644
--- a/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
+++ b/docs/i18n/fr/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
@@ -76,7 +76,7 @@ La fonction `run_controller()` est le cœur de l'exécution d'OpenHands. Elle g
 
 ## Le moyen le plus simple de commencer : Explorer les benchmarks existants
 
-Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) de notre dépôt.
+Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/benchmarks/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) de notre dépôt.
 
 Pour intégrer votre propre benchmark, nous vous suggérons de commencer par celui qui ressemble le plus à vos besoins. Cette approche peut considérablement rationaliser votre processus d'intégration, vous permettant de vous appuyer sur les structures existantes et de les adapter à vos exigences spécifiques.
 
diff --git a/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md b/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
index eb99a30ea3fd..dc41e0fa1ccf 100644
--- a/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
+++ b/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/current/usage/how-to/evaluation-harness.md
@@ -73,7 +73,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它工
 
 ## 入门最简单的方法：探索现有基准
 
-我们鼓励您查看我们仓库的 [`evaluation/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation)中提供的各种评估基准。
+我们鼓励您查看我们仓库的 [`evaluation/benchmarks/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks)中提供的各种评估基准。
 
 要集成您自己的基准，我们建议从最接近您需求的基准开始。这种方法可以显著简化您的集成过程，允许您在现有结构的基础上进行构建并使其适应您的特定要求。
 
diff --git a/docs/modules/usage/how-to/evaluation-harness.md b/docs/modules/usage/how-to/evaluation-harness.md
index e4d1e5d15bc7..339783ea8d7e 100644
--- a/docs/modules/usage/how-to/evaluation-harness.md
+++ b/docs/modules/usage/how-to/evaluation-harness.md
@@ -73,7 +73,7 @@ The `run_controller()` function is the core of OpenHands's execution. It manages
 
 ## Easiest way to get started: Exploring Existing Benchmarks
 
-We encourage you to review the various evaluation benchmarks available in the [`evaluation/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) of our repository.
+We encourage you to review the various evaluation benchmarks available in the [`evaluation/benchmarks/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) of our repository.
 
 To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
 
diff --git a/evaluation/README.md b/evaluation/README.md
index 8be0822875f8..18c1bacd6bd9 100644
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -46,28 +46,28 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across so
 
 ### Software Engineering
 
-- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
-- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
-- BIRD: [`evaluation/bird`](./bird)
-- BioCoder: [`evaluation/ml_bench`](./ml_bench)
-- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
-- APIBench: [`evaluation/gorilla`](./gorilla/)
-- ToolQA: [`evaluation/toolqa`](./toolqa/)
-- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
+- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
+- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
+- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
+- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
+- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
+- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
+- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
+- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)
 
 ### Web Browsing
 
-- WebArena: [`evaluation/webarena`](./webarena/)
-- MiniWob++: [`evaluation/miniwob`](./miniwob/)
+- WebArena: [`evaluation/benchmarks/webarena`](./benchmarks/webarena/)
+- MiniWob++: [`evaluation/benchmarks/miniwob`](./benchmarks/miniwob/)
 
 ### Misc. Assistance
 
-- GAIA: [`evaluation/gaia`](./gaia)
-- GPQA: [`evaluation/gpqa`](./gpqa)
-- AgentBench: [`evaluation/agent_bench`](./agent_bench)
-- MINT: [`evaluation/mint`](./mint)
-- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
-- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
+- GAIA: [`evaluation/benchmarks/gaia`](./benchmarks/gaia)
+- GPQA: [`evaluation/benchmarks/gpqa`](./benchmarks/gpqa)
+- AgentBench: [`evaluation/benchmarks/agent_bench`](./benchmarks/agent_bench)
+- MINT: [`evaluation/benchmarks/mint`](./benchmarks/mint)
+- Entity deduction Arena (EDA): [`evaluation/benchmarks/EDA`](./benchmarks/EDA)
+- ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning)
 
 ## Result Visualization
 
@@ -79,7 +79,7 @@ You can start your own fork of [our huggingface evaluation outputs](https://hugg
 
 To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). Briefly,
 
-- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
+- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/benchmarks/swe_bench` should contain
 all the preprocessing/evaluation/analysis scripts.
 - Raw data and experimental records should not be stored within this repo.
 - For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.