Feat/mle bench evaluation #5148

csmith49 · 2024-11-20T17:36:09Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.

The MLE-bench implementation provides:

A set of scripts to manage test instances, run benchmarks, and score results.
A base Docker image in which agents should be run.
An agent definition format.

The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands agent definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding around agent definitions to allow for benchmarking from this repo.

Link of any specific issues this addresses

#4328

csmith49 · 2024-11-21T19:08:02Z

This PR is currently blocked by #4848, reproducible as follows:

Install mlebench (instructions here) and build the mlebench-env image (instructions here).
Grab some data (instructions here) by running mlebench prepare -c spaceship-titanic.
Extend the mlebench-env image with OpenHands by navigating to evaluation/mle-bench and running docker build --platform=linux/amd64 -t openhands agents/openhands/.
Run python run_infer.py --agent-id openhands --competition-set experiments/splits/spaceship-titanic.txt.

Checking agent.log for the run shows:

[92m18:53:07 - openhands:INFO�[0m: runtime_build.py:176 - Building image: ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr
�[92m18:54:13 - openhands:ERROR�[0m: docker.py:122 - Image build failed:
Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
�[92m18:54:13 - openhands:ERROR�[0m: docker.py:123 - Command output:

Runtime created.
================ DOCKER BUILD STARTED ================
ERROR:root:  File "/home/agent/start.py", line 188, in <module>
    asyncio.run(run(instructions))
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/agent/start.py", line 70, in run
    await runtime.connect()
  File "/home/agent/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 225, in connect
    self.runtime_container_image = build_runtime_image(
                                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 134, in build_runtime_image
    result = build_runtime_image_in_folder(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 225, in build_runtime_image_in_folder
    _build_sandbox_image(
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 352, in _build_sandbox_image
    image_name = runtime_builder.build(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/builder/docker.py", line 114, in build
    raise subprocess.CalledProcessError(

ERROR:root:<class 'subprocess.CalledProcessError'>: Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
ERROR conda.cli.main_run:execute(125): `conda run python start.py --agent CodeActAgent --model gpt-4o --max_time_in_hours 24 --max_steps 500 --shm_size 100G` failed. (See above for error)

enyst · 2024-11-22T16:02:41Z

evaluation/mle_bench/environment/Dockerfile

+    && wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh \
+    && bash /tmp/miniconda.sh -b -p /opt/conda \
+    && rm /tmp/miniconda.sh \
+    && /opt/conda/bin/conda init


Just a thing here, not necessarily for today, but IMHO for the PR overall: we don't use Anaconda's repos and channels anywhere in the codebase, afaik. We have replaced this with miniforge and micromamba (example). These two are compatible with the other conda, and we can set the channel to community repositories.

The reason is Anaconda's weird licensing. It's not open source, and while it doesn't have unexpected terms for individuals or academia (iirc), it does have for companies/employees of 200 people or more. Please note also, the -b parameter of the script runs it silently, which, according to Anaconda, means that the user is "assumed to have agreed". (I think, if we must for some reason use this, we'd need to notify people in some very clear way.)

That's good to know, I'll make the switch where I can. The only reason it's included here is because the MLE-bench base agent sandbox images use it for managing a virtual environment.

github-actions · 2024-12-24T01:58:46Z

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2025-01-01T02:03:42Z

This PR was closed because it has been stalled for over 30 days with no activity.

enyst · 2025-01-01T02:08:24Z

@csmith49 The automated checks closed this PR. I reopened it to keep the status quo, but please feel free to do as you see fit.

mamoodi · 2025-01-14T21:57:53Z

Talked to Calvin. Going to close this and he will reopen once this is ready again.

csmith49 added 3 commits November 18, 2024 16:06

initial lift of relevant mle-bench code

e15e42f

builds, fails to build runtime

e2bf1f7

fixing dockerfile and build scripts, configuration broken

f2d1645

enyst reviewed Nov 22, 2024

View reviewed changes

docker-in-docker using host socket

99a373d

xingyaoww mentioned this pull request Dec 13, 2024

[Eval] Integrate MLE Bench Into Eval Harness #4328

Open

github-actions bot added the Stale Inactive for 30 days label Dec 24, 2024

github-actions bot closed this Jan 1, 2025

enyst reopened this Jan 1, 2025

github-actions bot removed the Stale Inactive for 30 days label Jan 2, 2025

mamoodi closed this Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/mle bench evaluation #5148

Feat/mle bench evaluation #5148

csmith49 commented Nov 20, 2024

csmith49 commented Nov 21, 2024

enyst Nov 22, 2024

csmith49 Nov 22, 2024

github-actions bot commented Dec 24, 2024

github-actions bot commented Jan 1, 2025

enyst commented Jan 1, 2025

mamoodi commented Jan 14, 2025

Feat/mle bench evaluation #5148

Feat/mle bench evaluation #5148

Conversation

csmith49 commented Nov 20, 2024

csmith49 commented Nov 21, 2024

enyst Nov 22, 2024

Choose a reason for hiding this comment

csmith49 Nov 22, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 24, 2024

github-actions bot commented Jan 1, 2025

enyst commented Jan 1, 2025

mamoodi commented Jan 14, 2025