Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/mle bench evaluation #5148

Closed

Conversation

csmith49
Copy link
Collaborator

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.

The MLE-bench implementation provides:

  1. A set of scripts to manage test instances, run benchmarks, and score results.
  2. A base Docker image in which agents should be run.
  3. An agent definition format.

The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands agent definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding around agent definitions to allow for benchmarking from this repo.


Link of any specific issues this addresses

#4328

@csmith49
Copy link
Collaborator Author

This PR is currently blocked by #4848, reproducible as follows:

  1. Install mlebench (instructions here) and build the mlebench-env image (instructions here).
  2. Grab some data (instructions here) by running mlebench prepare -c spaceship-titanic.
  3. Extend the mlebench-env image with OpenHands by navigating to evaluation/mle-bench and running docker build --platform=linux/amd64 -t openhands agents/openhands/.
  4. Run python run_infer.py --agent-id openhands --competition-set experiments/splits/spaceship-titanic.txt.

Checking agent.log for the run shows:

[92m18:53:07 - openhands:INFO�[0m: runtime_build.py:176 - Building image: ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr
�[92m18:54:13 - openhands:ERROR�[0m: docker.py:122 - Image build failed:
Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
�[92m18:54:13 - openhands:ERROR�[0m: docker.py:123 - Command output:

Runtime created.
================ DOCKER BUILD STARTED ================
ERROR:root:  File "/home/agent/start.py", line 188, in <module>
    asyncio.run(run(instructions))
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/agent/start.py", line 70, in run
    await runtime.connect()
  File "/home/agent/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 225, in connect
    self.runtime_container_image = build_runtime_image(
                                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 134, in build_runtime_image
    result = build_runtime_image_in_folder(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 225, in build_runtime_image_in_folder
    _build_sandbox_image(
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 352, in _build_sandbox_image
    image_name = runtime_builder.build(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/builder/docker.py", line 114, in build
    raise subprocess.CalledProcessError(

ERROR:root:<class 'subprocess.CalledProcessError'>: Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
ERROR conda.cli.main_run:execute(125): `conda run python start.py --agent CodeActAgent --model gpt-4o --max_time_in_hours 24 --max_steps 500 --shm_size 100G` failed. (See above for error)

&& wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh \
&& bash /tmp/miniconda.sh -b -p /opt/conda \
&& rm /tmp/miniconda.sh \
&& /opt/conda/bin/conda init
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thing here, not necessarily for today, but IMHO for the PR overall: we don't use Anaconda's repos and channels anywhere in the codebase, afaik. We have replaced this with miniforge and micromamba (example). These two are compatible with the other conda, and we can set the channel to community repositories.

The reason is Anaconda's weird licensing. It's not open source, and while it doesn't have unexpected terms for individuals or academia (iirc), it does have for companies/employees of 200 people or more. Please note also, the -b parameter of the script runs it silently, which, according to Anaconda, means that the user is "assumed to have agreed". (I think, if we must for some reason use this, we'd need to notify people in some very clear way.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know, I'll make the switch where I can. The only reason it's included here is because the MLE-bench base agent sandbox images use it for managing a virtual environment.

Copy link
Contributor

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Dec 24, 2024
Copy link
Contributor

github-actions bot commented Jan 1, 2025

This PR was closed because it has been stalled for over 30 days with no activity.

@github-actions github-actions bot closed this Jan 1, 2025
@enyst enyst reopened this Jan 1, 2025
@enyst
Copy link
Collaborator

enyst commented Jan 1, 2025

@csmith49 The automated checks closed this PR. I reopened it to keep the status quo, but please feel free to do as you see fit.

@github-actions github-actions bot removed the Stale Inactive for 30 days label Jan 2, 2025
@mamoodi
Copy link
Collaborator

mamoodi commented Jan 14, 2025

Talked to Calvin. Going to close this and he will reopen once this is ready again.

@mamoodi mamoodi closed this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants