Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Replace pexpect with libtmux in BashSession #4881

Merged
merged 285 commits into from
Jan 3, 2025
Merged
Show file tree
Hide file tree
Changes from 223 commits
Commits
Show all changes
285 commits
Select commit Hold shift + click to select a range
3dd21fa
Improve test coverage for BashSession
openhands-agent Nov 11, 2024
efc481f
Merge commit '910b283ac2f6b3896e174cb77377c5ab6900da22' into feature/…
xingyaoww Nov 12, 2024
03ba929
only clean screen if prev status is not timeout
xingyaoww Nov 12, 2024
92b2b0c
also don't clear screen on CONTINUE
xingyaoww Nov 12, 2024
fec3083
fix command prefix
xingyaoww Nov 12, 2024
a28212b
tweak debug viz
xingyaoww Nov 12, 2024
f03de49
print agent obs
xingyaoww Nov 12, 2024
42b69a3
tweak
xingyaoww Nov 12, 2024
4ee07fe
Merge commit 'a93f1402debd325dac68360650bd12ae6abad643' into feature/…
xingyaoww Nov 14, 2024
99ef1ef
make timeout configurable
xingyaoww Nov 14, 2024
34a14fd
add info when command completed
xingyaoww Nov 14, 2024
c3ae9cf
refactor _get_command_output
xingyaoww Nov 14, 2024
3affa77
rename custom prefix to continue_prefix
xingyaoww Nov 14, 2024
d90a338
add missing newlines
xingyaoww Nov 14, 2024
4d47241
fix ctrl+c
xingyaoww Nov 14, 2024
3e1f12b
allow continue if prev command is also continue
xingyaoww Nov 14, 2024
0df9ba2
tweak bash doc
xingyaoww Nov 14, 2024
ebeccc6
improve "continue" mode for bash
xingyaoww Nov 14, 2024
fa351fe
fix all bash session tests
xingyaoww Nov 14, 2024
0c14a80
Merge commit '00ffc33d1bdc4d3287d26a4b63cedd2244e96570' into feature/…
xingyaoww Nov 15, 2024
77b4c7c
fix linter
xingyaoww Nov 15, 2024
bc3428a
add tmux to Dockerfile
xingyaoww Nov 15, 2024
7a8ff37
make eventstream runtime atexit register happen only at init
xingyaoww Nov 15, 2024
f9f37ad
improve ps1 for env commands
xingyaoww Nov 15, 2024
fae1185
fix CmdOutputObservation constructor; fix env test
xingyaoww Nov 15, 2024
184794a
fix cmdoutput constructor
xingyaoww Nov 15, 2024
e1c2ac2
handle multi line session
xingyaoww Nov 18, 2024
fce1b07
fix multiline runtime tests
xingyaoww Nov 18, 2024
fab1438
(hopefully) fix all tests
xingyaoww Nov 18, 2024
c7aee63
Merge commit 'de821718fd579448150a8a614be9da550fd743bf' into feature/…
xingyaoww Nov 18, 2024
f5d23b3
update poetry lock
xingyaoww Nov 18, 2024
88658dd
fix security test
xingyaoww Nov 18, 2024
25ae18c
fix deserialization test
xingyaoww Nov 18, 2024
488a1a7
refactor imports
xingyaoww Nov 18, 2024
c173a03
fix codeact test
xingyaoww Nov 18, 2024
1bb9f82
Add tmux installation to GitHub workflows
openhands-agent Nov 18, 2024
7752a94
Add tmux installation to additional GitHub workflows
openhands-agent Nov 18, 2024
cd94759
feat: add keep_prompt parameter to CmdRunAction
openhands-agent Nov 18, 2024
d491e47
feat: implement keep_prompt handling in BashSession and add tests
openhands-agent Nov 18, 2024
d76bbfa
Revert "feat: implement keep_prompt handling in BashSession and add t…
xingyaoww Nov 18, 2024
4d1c742
Revert "feat: add keep_prompt parameter to CmdRunAction"
xingyaoww Nov 18, 2024
51d0bcb
refactor: move prefix/suffix to CmdOutputMetadata
openhands-agent Nov 18, 2024
fa714a4
test: update test_bash_session.py to verify prefix/suffix fields
openhands-agent Nov 18, 2024
53d2de2
fix testcase
xingyaoww Nov 18, 2024
fefabd1
fix tests
xingyaoww Nov 18, 2024
e5001a3
improve 500 error message
xingyaoww Nov 18, 2024
9c00bd6
improve error message & fix ps1 parsing
xingyaoww Nov 18, 2024
7ef9e37
fix test
xingyaoww Nov 18, 2024
20721e3
remove keep_prompt from everywhere
xingyaoww Nov 18, 2024
725eeb1
fix resolver tests
xingyaoww Nov 18, 2024
1dfee78
improve error message
xingyaoww Nov 18, 2024
9fe792f
fix resolver test
xingyaoww Nov 18, 2024
c335b1e
fix test bash ps1
xingyaoww Nov 18, 2024
a15708a
remove the complex local tmux test
xingyaoww Nov 18, 2024
ff3d971
try fix conflict of tmux session
xingyaoww Nov 19, 2024
22a2572
Merge commit 'a531413d8649640842d2e639e15b4e7ecadf35c5' into feature/…
xingyaoww Nov 19, 2024
904bc29
remove command id
xingyaoww Nov 19, 2024
f016fbc
fix test
xingyaoww Nov 19, 2024
0f40b4c
fix PS1 parsing
xingyaoww Nov 19, 2024
313a901
fix ipython pwd
xingyaoww Nov 19, 2024
1f9168a
remove specified sid
xingyaoww Nov 19, 2024
1a40358
only raise RuntimeError when error code >= 500
xingyaoww Nov 19, 2024
95add43
relax tests
xingyaoww Nov 19, 2024
6da2636
resize tmux window
xingyaoww Nov 19, 2024
bc995ef
temporarily bump ver for runtime
xingyaoww Nov 19, 2024
153a501
fix resize arg
xingyaoww Nov 19, 2024
796a100
simplify test bash session in favor of runtime test
xingyaoww Nov 19, 2024
483f4b1
tweak test
xingyaoww Nov 19, 2024
3d7b44c
tweak test
xingyaoww Nov 19, 2024
bd12b99
fix serialization for CmdOutputMetadata
xingyaoww Nov 19, 2024
f2d57f9
Merge commit '302e41d7bb3d5b2b319f1ce2d15e5925dda069a2' into feature/…
xingyaoww Nov 19, 2024
cf7897b
hopefully fixes the bash
xingyaoww Nov 19, 2024
b430cb4
fix empty cmd handling
xingyaoww Nov 19, 2024
bae44a7
fix test
xingyaoww Nov 19, 2024
60daaa3
fix request
xingyaoww Nov 19, 2024
04397fe
fix VERY long cmd output
xingyaoww Nov 19, 2024
206eb19
update runtime test for looooong output
xingyaoww Nov 19, 2024
a0b5c9f
fix history limit
xingyaoww Nov 20, 2024
868e5a3
fix window start dir
xingyaoww Nov 20, 2024
48a866f
tweak
xingyaoww Nov 20, 2024
e914055
Merge commit '68e52a9c62f4cc6d48d33c5f1179aa4c1008b5a8' into feature/…
xingyaoww Nov 21, 2024
902a484
Merge commit '36d85b65c809f0c522c590dce5c6f96d48169dae' into feature/…
xingyaoww Nov 25, 2024
5e4e238
merge
xingyaoww Nov 26, 2024
e8d734d
get preliminary ver of pipe-pane working
xingyaoww Nov 27, 2024
96fa5be
get bash session tests working with pipe-pane
xingyaoww Nov 27, 2024
a657690
ok we may need to live with color when doing pipe-pane
xingyaoww Nov 27, 2024
8132820
Merge commit '082a55195ffa669ff71669156f9d8aa887217075' into feature/…
xingyaoww Nov 27, 2024
3ad3a39
add tests back
xingyaoww Nov 27, 2024
f360b87
add destructor
xingyaoww Dec 2, 2024
49be926
cleanup bracketed-paste
xingyaoww Dec 2, 2024
6fd958c
only .close() if not closed
xingyaoww Dec 2, 2024
61ebe54
Merge commit '5069a8700a8fc1219b10e2b57b1922eab995ec9f' into feature/…
xingyaoww Dec 2, 2024
6db1672
remove ansi test
xingyaoww Dec 2, 2024
b1652be
improve debug log
xingyaoww Dec 2, 2024
c50a45e
feat: display exact error for runtime requests exception handling
xingyaoww Dec 3, 2024
fb19118
Merge commit 'c50a45e9f058182def36c4a07650323ebaee020a' into feature/…
xingyaoww Dec 3, 2024
40e6767
fix action execution detail
xingyaoww Dec 3, 2024
a5815e6
fix action execution detail
xingyaoww Dec 3, 2024
dce4a38
replace all occurences of requests.HTTPError
xingyaoww Dec 3, 2024
f05af62
replace all occurences of requests.HTTPError
xingyaoww Dec 3, 2024
ab4f0e4
simplify error
xingyaoww Dec 3, 2024
3d03509
Merge commit 'ab4f0e497046c97e47e3d2cf369bdd16f6593a09' into feature/…
xingyaoww Dec 3, 2024
84c75e4
only print stacktrace
xingyaoww Dec 3, 2024
79410c5
get pipe to work for bash session (kinda)
xingyaoww Dec 3, 2024
a878109
do not reset pane every time
xingyaoww Dec 3, 2024
115cde3
remove extra debug; fix session test
xingyaoww Dec 3, 2024
990fb03
fix bug for very long outputs
xingyaoww Dec 3, 2024
cc44952
reduce freq of getting pane output & parse ps1
xingyaoww Dec 3, 2024
ba52ac5
Merge commit '1b8104ba14234599ce3a19e266582be7b87cf23c' into feature/…
xingyaoww Dec 3, 2024
be8c9d5
get read -p test back
xingyaoww Dec 3, 2024
aee78f3
disable enter name check
xingyaoww Dec 3, 2024
dc2c23b
fix cleanup
xingyaoww Dec 3, 2024
8db6055
always combine outputs between matches on all cases
xingyaoww Dec 3, 2024
043cc16
fix combine output bugs
xingyaoww Dec 3, 2024
4641566
strip commands before execute; fix bash loop
xingyaoww Dec 4, 2024
2b554bd
log openhands version in eval runs, instead of agent ver
xingyaoww Dec 4, 2024
2052829
fix ver
xingyaoww Dec 4, 2024
4d6d069
use get_version
xingyaoww Dec 4, 2024
a3fff39
support log debug remotely1
xingyaoww Dec 4, 2024
12ecd35
support directly stream docker/devbox logs to stdout in debug mode
xingyaoww Dec 4, 2024
4fa842e
add sse-starlette
xingyaoww Dec 4, 2024
0952c38
tweak test
xingyaoww Dec 4, 2024
f085364
hit enter for cases when matches <1
xingyaoww Dec 4, 2024
923f88d
Merge commit 'ceb60b9a37d669a51945710ae036e7fc428dc7e9' into feature/…
xingyaoww Dec 5, 2024
3ea1fd8
handle multiple ps1 before start
xingyaoww Dec 5, 2024
4615908
fix poetry lock
xingyaoww Dec 5, 2024
c90a95a
print pod log when failed remote runtime
xingyaoww Dec 5, 2024
f093c69
use non-login shell to start a new shell for the given user
xingyaoww Dec 5, 2024
3ae045f
condense test_bash to single line
xingyaoww Dec 5, 2024
8569e7a
update pyproject ver
xingyaoww Dec 5, 2024
b1fde67
revert window command
xingyaoww Dec 5, 2024
1679810
do login
xingyaoww Dec 5, 2024
27c2455
increase timeout
xingyaoww Dec 5, 2024
69d8f34
add tests
xingyaoww Dec 5, 2024
ace691e
revert to polling capture-pane since pipe-pane can't capture prompts …
xingyaoww Dec 5, 2024
2a41ee5
log decoder error for match ps1
xingyaoww Dec 5, 2024
2e8452e
remove commented code
xingyaoww Dec 5, 2024
0514bed
add test_python_interactive_input to test_bash
xingyaoww Dec 5, 2024
4a2c880
increase history limit
xingyaoww Dec 5, 2024
85c5431
increase timeout
xingyaoww Dec 5, 2024
47d0ba4
reduce num lines for testing
xingyaoww Dec 5, 2024
7cbebdf
reduce max lines
xingyaoww Dec 5, 2024
f34dbd3
update implementation to handle overly long cmd output
xingyaoww Dec 6, 2024
ef04cdb
increase timeout for CI
xingyaoww Dec 6, 2024
eb27320
remove extra stuff from tests
xingyaoww Dec 6, 2024
db8114e
handle requests.exceptions.JSONDecodeError
xingyaoww Dec 9, 2024
d5c5db6
fix request error handling
xingyaoww Dec 10, 2024
256b352
add a bunch of debug log
xingyaoww Dec 13, 2024
29bf36b
Merge commit '8ae2fb636eb9ded9039ea8c3a7227b3fce5cc68b' into feature/…
xingyaoww Dec 13, 2024
6ec1683
get git op tests
xingyaoww Dec 13, 2024
14b1085
add mechanism to avoid double newline
xingyaoww Dec 13, 2024
f529bc8
try fix serialization
xingyaoww Dec 13, 2024
00253b9
try fix serialization
xingyaoww Dec 13, 2024
47ae5bf
fix serialization
xingyaoww Dec 13, 2024
9ded783
fix command success test
xingyaoww Dec 13, 2024
06f7694
fix tests
xingyaoww Dec 13, 2024
21e497b
Merge commit 'd733bc6bdd8e743d2e5a7f5fe592f7462548c5d9' into feature/…
xingyaoww Dec 13, 2024
9bd5143
fix test case
xingyaoww Dec 16, 2024
6cf0a08
return alive only when client is initialized
xingyaoww Dec 17, 2024
5953ee8
update log
xingyaoww Dec 17, 2024
a5404b8
add check for python interpreter
xingyaoww Dec 17, 2024
2dba843
add cwd to agent observation
xingyaoww Dec 17, 2024
dfb33ca
remove request body
xingyaoww Dec 17, 2024
06a68eb
use cp -r instead of mv
xingyaoww Dec 17, 2024
e6f095c
Merge commit '3297e4d5a8c8578bbe220bed6489d74a659a832a' into feature/…
xingyaoww Dec 17, 2024
faaf63d
increase timeout
xingyaoww Dec 18, 2024
22cb1e6
set max retries back to 5
xingyaoww Dec 18, 2024
e5f798b
make cannot restore state a debug message
xingyaoww Dec 18, 2024
fcc7fdf
cleanup runtime exception handling
xingyaoww Dec 19, 2024
65742fa
increase resource factor for runtime when previous run failed likely …
xingyaoww Dec 20, 2024
901a2c8
remove stuck in look from fatal exception; add AgentRuntimeUnavailabl…
xingyaoww Dec 20, 2024
2bf3202
Merge commit '73c38f1163cc37048c3e31e1941fe4cd798c296e' into feature/…
xingyaoww Dec 20, 2024
b4ed2dc
replace while true with while should_continue
xingyaoww Dec 20, 2024
be8914b
rename pwd to cwd
xingyaoww Dec 20, 2024
8f2e9a9
move bash init logic to a separate init function
xingyaoww Dec 20, 2024
178e029
update resource factor
xingyaoww Dec 20, 2024
7498fe4
Merge commit 'd62cf7e7319850ce8c0dc47a3ddab0f4151d2af6' into feature/…
xingyaoww Dec 23, 2024
fa78313
add initialized for bash session
xingyaoww Dec 23, 2024
8040497
make sure legacy CmdOutputObservation is still serializable
xingyaoww Dec 23, 2024
5ff8998
fix missing init
xingyaoww Dec 23, 2024
c5ca25f
re-order thought
xingyaoww Dec 23, 2024
b34beaa
fix serialization of action
xingyaoww Dec 23, 2024
73f379e
fix obs serialization
xingyaoww Dec 23, 2024
c593295
fix serialization
xingyaoww Dec 23, 2024
bf34c7e
try fix test
xingyaoww Dec 23, 2024
68ffd0c
fix test again
xingyaoww Dec 24, 2024
cf98287
Merge commit 'ecff5c67fb7f1995556f0f36f5050f33dc0953d2' into feature/…
xingyaoww Dec 24, 2024
bb9c19b
pretty print file write action
xingyaoww Dec 24, 2024
c89677d
improve util script for swebench
xingyaoww Dec 26, 2024
165ee7a
print actual visualization file path of the diff
xingyaoww Dec 26, 2024
9bc721b
fix grab test_output logic
xingyaoww Dec 26, 2024
66fea9f
add runtime error failure recovery for eval_infer
xingyaoww Dec 26, 2024
128ba2f
tentatively support run tests
xingyaoww Dec 27, 2024
485aad8
Revert "tentatively support run tests"
xingyaoww Dec 27, 2024
c1cdbd5
add test case for one exposed bug during eval
xingyaoww Dec 29, 2024
1066544
Merge commit '95f7a6a4dc4e9ac3eabbfafac0f04549d62cd0d6' into feature/…
xingyaoww Dec 29, 2024
2d38b28
add str obs for file read/write
xingyaoww Dec 29, 2024
71029ad
Update openhands/events/serialization/event.py
xingyaoww Dec 29, 2024
593eb8d
Update openhands/events/observation/commands.py
xingyaoww Dec 29, 2024
54ec610
Update openhands/runtime/utils/bash.py
xingyaoww Dec 29, 2024
a249a08
Merge branch 'feature/tmux-shell' of https://github.com/All-Hands-AI/…
xingyaoww Dec 29, 2024
5b10650
handle serialization for command
xingyaoww Dec 29, 2024
040ee11
fix dict in-place modification
xingyaoww Dec 29, 2024
2e7d72b
fix: handling of special bash escape character like \;
xingyaoww Dec 30, 2024
fad57b7
Fix escaping of \; in bash commands
xingyaoww Dec 30, 2024
a8db8cd
Merge commit '0e8e3c87f31705d83aeca8e3748eecc7a545dd3c' into feature/…
xingyaoww Dec 30, 2024
9e8b483
add tests
xingyaoww Dec 30, 2024
f859d55
fix: preserve special chars in command substitution
openhands-agent Dec 30, 2024
db6e9b5
skip part that has not "parts" attr
xingyaoww Dec 30, 2024
f7a2a8b
add test
xingyaoww Dec 30, 2024
a3a8d7e
fix: properly handle heredoc content in escape_bash_special_chars by …
openhands-agent Dec 30, 2024
1a0dc94
fix: properly reset heredoc state when end marker is encountered
openhands-agent Dec 30, 2024
599d1dc
fix: properly handle heredoc content in escape_bash_special_chars usi…
openhands-agent Dec 30, 2024
ed45cd1
simplify bash
xingyaoww Dec 30, 2024
b0d5324
Merge commit 'c37e865c56587ad48608bcc5ba854bc20999ec6f' into feature/…
xingyaoww Dec 30, 2024
86aed3e
handle none heredoc
xingyaoww Dec 30, 2024
8e9b7e5
default use_microagents to false
xingyaoww Dec 31, 2024
41737fb
only enable microagent in session by default
xingyaoww Dec 31, 2024
6eb7b21
add retry for potential rate limit
xingyaoww Dec 31, 2024
996129a
consider 404 502 and 503 as runtime error
xingyaoww Jan 1, 2025
a699ac5
change max resource to 8
xingyaoww Jan 2, 2025
3e13f78
consider 404 502 and 503 as runtime error
xingyaoww Jan 1, 2025
81655ea
add retry for potential rate limit
xingyaoww Dec 31, 2024
9a8878d
only enable microagent in session by default
xingyaoww Dec 31, 2024
38f806c
change default value of use_microagents
xingyaoww Jan 2, 2025
53d436c
move logic to remote runtime
xingyaoww Jan 2, 2025
611dea7
change 502 to disconnected
xingyaoww Jan 2, 2025
fe7298e
Fix pr #5976: Set default value of use_microagents to False to preven…
openhands-agent Jan 2, 2025
24c39b2
Update evaluation/benchmarks/agent_bench/run_infer.py
xingyaoww Jan 2, 2025
e3bd35f
Update openhands/server/session/session.py
xingyaoww Jan 2, 2025
3509456
Update openhands/core/cli.py
xingyaoww Jan 2, 2025
7cf99a5
Merge commit '3509456a67e47e8a459be5294796cc96b2fc4d30' into feature/…
xingyaoww Jan 2, 2025
1a7f067
Merge commit '611dea76b4d884cf2e81f5d656f796de7945b575' into feature/…
xingyaoww Jan 2, 2025
a6709be
Merge commit 'a1b59b6185a80fd26cb7c948c074081ac84aba59' into feature/…
xingyaoww Jan 2, 2025
a0fa58b
update compare outputs script for better spotting of issues
xingyaoww Jan 2, 2025
11c55d2
Merge commit 'c567c1126745fe2fefc1c6e90be7a388d657d067' into feature/…
xingyaoww Jan 3, 2025
866e7f8
Merge branch 'main' into feature/tmux-shell
xingyaoww Jan 3, 2025
a4b581f
chore: remove extra debugging print
xingyaoww Jan 3, 2025
752f341
revert default microagent changes
xingyaoww Jan 3, 2025
d9e0486
Merge branch 'main' into feature/tmux-shell
xingyaoww Jan 3, 2025
e0692e8
Merge branch 'main' into feature/tmux-shell
xingyaoww Jan 3, 2025
782c619
Merge branch 'main' into feature/tmux-shell
xingyaoww Jan 3, 2025
2025961
tweak order
xingyaoww Jan 3, 2025
d0a6d71
remove http code handling from execution cli
xingyaoww Jan 3, 2025
dbb3953
minor command output tweak
xingyaoww Jan 3, 2025
db274a1
Merge branch 'main' into feature/tmux-shell
rbren Jan 3, 2025
f601e19
fix error observation for command
xingyaoww Jan 3, 2025
64a2f6d
fix terminal
xingyaoww Jan 3, 2025
21c4230
remove extra space after cmd output; add newline before "Output:"
xingyaoww Jan 3, 2025
31bdc15
fix linter
xingyaoww Jan 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/dummy-agent-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ jobs:
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v3
- name: Install tmux
run: sudo apt-get update && sudo apt-get install -y tmux
- name: Install poetry via pipx
run: pipx install poetry
- name: Set up Python
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/eval-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4

- name: Install tmux
run: sudo apt-get update && sudo apt-get install -y tmux
- name: Install poetry via pipx
run: pipx install poetry

Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/py-unit-tests-mac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ jobs:
key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-poetry-
- name: Install tmux
run: brew install tmux
- name: Install poetry via pipx
run: pipx install poetry
- name: Install Python dependencies using Poetry
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/py-unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ jobs:
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v3
- name: Install tmux
run: sudo apt-get update && sudo apt-get install -y tmux
- name: Install poetry via pipx
run: pipx install poetry
- name: Set up Python
Expand Down
1 change: 0 additions & 1 deletion docs/static/img/backend_architecture.puml
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,6 @@ class openhands.state.State {
updated_info: List[Tuple[Action, Observation]]
}
class openhands.observation.CmdOutputObservation {
command_id: int
command: str
exit_code: int
observation: str
Expand Down
4 changes: 1 addition & 3 deletions evaluation/benchmarks/agent_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,6 @@ def complete_runtime(

action = CmdRunAction(
command=f'chmod +x ./{script_name} && ./{script_name}',
keep_prompt=False,
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
Expand All @@ -162,8 +161,7 @@ def complete_runtime(
logger.info(f'Running get ground truth cmd: {script_name}')

action = CmdRunAction(
command=f'chmod +x ./{script_name} && ./{script_name}',
keep_prompt=False,
command=f'chmod +x ./{script_name} && ./{script_name}'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
Expand Down
5 changes: 1 addition & 4 deletions evaluation/benchmarks/aider_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,10 +143,7 @@ def complete_runtime(
)
logger.info(f'Running test file: {script_name}')

action = CmdRunAction(
command=f'python3 -m unittest {script_name}',
keep_prompt=False,
)
action = CmdRunAction(command=f'python3 -m unittest {script_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
Expand Down
6 changes: 2 additions & 4 deletions evaluation/benchmarks/biocoder/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ def complete_runtime(
if obs.exit_code == 0:
test_result['metadata']['1_copy_change_success'] = True

action = CmdRunAction(command=f'cat {generated_path}', keep_prompt=False)
action = CmdRunAction(command=f'cat {generated_path}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
assert obs.exit_code == 0
Expand All @@ -221,9 +221,7 @@ def complete_runtime(
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0

action = CmdRunAction(
command='cat /testing_files/results_biocoder.json', keep_prompt=False
)
action = CmdRunAction(command='cat /testing_files/results_biocoder.json')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
if obs.exit_code == 0:
Expand Down
1 change: 0 additions & 1 deletion evaluation/benchmarks/bird/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "california_schools/california_schools.sqlite\r\n[(1.0,)]",
"extras": {
"command_id": -1,
"command": "python3 0.py",
"exit_code": 0
}
Expand Down
10 changes: 2 additions & 8 deletions evaluation/benchmarks/bird/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,10 +266,7 @@ def initialize_runtime(
runtime.copy_to(db_file, '/workspace')

# Check the database is copied
action = CmdRunAction(
command='cd /workspace && ls -l',
keep_prompt=False,
)
action = CmdRunAction(command='cd /workspace && ls -l')
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
Expand Down Expand Up @@ -298,10 +295,7 @@ def complete_runtime(
instance_id = instance.instance_id.replace('/', '__')
path = os.path.join('/workspace', f'{instance_id}.py')

action = CmdRunAction(
command=f'cat {path}',
keep_prompt=False,
)
action = CmdRunAction(command=f'cat {path}')
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})

Expand Down
3 changes: 0 additions & 3 deletions evaluation/benchmarks/humanevalfix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "[File: /workspace/Python__2.py (14 lines total)]\r\n1:def truncate_number(number: float) -> float:\r\n2: return number % 1.0 + 1.0\r\n3:\r\n4:\r\n5:\r\n6:\r\n7:\r\n8:\r\n9:def check(truncate_number):\r\n10: assert truncate_number(3.5) == 0.5\r\n11: assert abs(truncate_number(1.33) - 0.33) < 1e-6\r\n12: assert abs(truncate_number(123.456) - 0.456) < 1e-6\r\n13:\r\n14:check(truncate_number)",
"extras": {
"command_id": -1,
"command": "open Python__2.py",
"exit_code": 0
}
Expand All @@ -98,7 +97,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "> > [File: /workspace/Python__2.py (14 lines total)]\r\n1:def truncate_number(number: float) -> float:\r\n2: return number % 1.0\r\n3:\r\n4:\r\n5:\r\n6:\r\n7:\r\n8:\r\n9:def check(truncate_number):\r\n10: assert truncate_number(3.5) == 0.5\r\n11: assert abs(truncate_number(1.33) - 0.33) < 1e-6\r\n12: assert abs(truncate_number(123.456) - 0.456) < 1e-6\r\n13:\r\n14:check(truncate_number)\r\nFile updated. Please review the changes and make sure they are correct (correct indentation, no duplicate lines, etc). Edit the file again if necessary.",
"extras": {
"command_id": -1,
"command": "edit 2:2 <<EOF\n return number % 1.0\nEOF",
"exit_code": 0
}
Expand All @@ -125,7 +123,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "",
"extras": {
"command_id": -1,
"command": "python3 Python__2.py",
"exit_code": 0
}
Expand Down
4 changes: 1 addition & 3 deletions evaluation/benchmarks/humanevalfix/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,7 @@ def complete_runtime(
num_workers = LANGUAGE_TO_NUM_WORKERS[language]
python_imports = '\n'.join(IMPORT_HELPER[language])

action = CmdRunAction(
command=f'cat /workspace/{_get_instance_id(instance)}.py', keep_prompt=False
)
action = CmdRunAction(command=f'cat /workspace/{_get_instance_id(instance)}.py')
obs = runtime.run_action(action)
assert obs.exit_code == 0

Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/ml_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def complete_runtime(
eval_script = os.path.join(task_path, 'run.sh')
logger.info(f'Running evaluation script: {eval_script}')

action = CmdRunAction(command=f'cat {eval_script}', keep_prompt=False)
action = CmdRunAction(command=f'cat {eval_script}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
if obs.exit_code == 0:
Expand Down
10 changes: 2 additions & 8 deletions evaluation/benchmarks/scienceagentbench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,7 @@ def initialize_runtime(
runtime.copy_to(dataset_dir, '/workspace/benchmark/datasets', recursive=True)

# Check the dataset exists
action = CmdRunAction(
command='cd /workspace/benchmark/datasets && ls',
keep_prompt=False,
)
action = CmdRunAction(command='cd /workspace/benchmark/datasets && ls')
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
Expand Down Expand Up @@ -154,10 +151,7 @@ def complete_runtime(

assert obs.exit_code == 0

action = CmdRunAction(
command=f'cat pred_programs/{instance.pred_program_name}',
keep_prompt=False,
)
action = CmdRunAction(command=f'cat pred_programs/{instance.pred_program_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)

Expand Down
10 changes: 4 additions & 6 deletions evaluation/benchmarks/swe_bench/eval_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ def process_instance(
"(patch --batch --fuzz=5 -p1 -i /tmp/patch.diff && echo 'APPLY_PATCH_PASS' || "
"echo 'APPLY_PATCH_FAIL')))"
)
action = CmdRunAction(command=exec_command, keep_prompt=False)
action = CmdRunAction(command=exec_command)
action.timeout = 600
obs = runtime.run_action(action)
assert isinstance(obs, CmdOutputObservation)
Expand All @@ -200,9 +200,7 @@ def process_instance(

# Run eval script in background and save output to log file
log_file = '/tmp/eval_output.log'
action = CmdRunAction(
command=f'/tmp/eval.sh > {log_file} 2>&1 & echo $!', keep_prompt=False
)
action = CmdRunAction(command=f'/tmp/eval.sh > {log_file} 2>&1 & echo $!')
action.timeout = 60 # Short timeout just to get the process ID
obs = runtime.run_action(action)

Expand All @@ -224,7 +222,7 @@ def process_instance(
instance['test_result']['report']['test_timeout'] = True
break
check_action = CmdRunAction(
command=f'ps -p {pid} > /dev/null; echo $?', keep_prompt=False
command=f'ps -p {pid} > /dev/null; echo $?'
)
check_action.timeout = 60
check_obs = runtime.run_action(check_action)
Expand All @@ -242,7 +240,7 @@ def process_instance(
time.sleep(30) # Wait for 30 seconds before checking again

# Read the log file
cat_action = CmdRunAction(command=f'cat {log_file}', keep_prompt=False)
cat_action = CmdRunAction(command=f'cat {log_file}')
cat_action.timeout = 300
cat_obs = runtime.run_action(cat_action)

Expand Down
16 changes: 13 additions & 3 deletions evaluation/benchmarks/swe_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,16 @@ def initialize_runtime(
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')

action = CmdRunAction(command='which python')
action.timeout = 600
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0 and 'testbed' in obs.content,
f'Expected to find python interpreter from testbed, but got: {str(obs)}',
)

logger.info('-' * 30)
logger.info('END Runtime Initialization Fn')
logger.info('-' * 30)
Expand Down Expand Up @@ -337,8 +347,7 @@ def complete_runtime(
git_patch = None
while n_retries < 5:
action = CmdRunAction(
command=f'git diff --no-color --cached {instance["base_commit"]}',
keep_prompt=False,
command=f'git diff --no-color --cached {instance["base_commit"]}'
)
action.timeout = 600 + 100 * n_retries
logger.info(action, extra={'msg_type': 'ACTION'})
Expand Down Expand Up @@ -385,7 +394,7 @@ def process_instance(
if runtime_failure_count > 0:
config.sandbox.remote_runtime_resource_factor = min(
config.sandbox.remote_runtime_resource_factor * (2**runtime_failure_count),
2, # hardcode maximum resource factor to 2
4, # hardcode maximum resource factor to 4
)
logger.warning(
f'This is the second attempt for instance {instance.instance_id}, setting resource factor to {config.sandbox.remote_runtime_resource_factor}'
Expand Down Expand Up @@ -535,4 +544,5 @@ def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:
args.eval_num_workers,
process_instance,
timeout_seconds=120 * 60, # 2 hour PER instance should be more than enough
max_retries=5,
)
2 changes: 1 addition & 1 deletion evaluation/integration_tests/tests/t01_fix_simple_typo.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def initialize_runtime(cls, runtime: Runtime) -> None:
@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/bad.txt has been fixed
action = CmdRunAction(command='cat /workspace/bad.txt', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/bad.txt')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand Down
6 changes: 3 additions & 3 deletions evaluation/integration_tests/tests/t02_add_bash_hello.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ class Test(BaseIntegrationTest):

@classmethod
def initialize_runtime(cls, runtime: Runtime) -> None:
action = CmdRunAction(command='mkdir -p /workspace', keep_prompt=False)
action = CmdRunAction(command='mkdir -p /workspace')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/hello.sh exists
action = CmdRunAction(command='cat /workspace/hello.sh', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/hello.sh')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand All @@ -26,7 +26,7 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
)

# execute the script
action = CmdRunAction(command='bash /workspace/hello.sh', keep_prompt=False)
action = CmdRunAction(command='bash /workspace/hello.sh')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand Down
6 changes: 3 additions & 3 deletions evaluation/integration_tests/tests/t03_jupyter_write_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ class Test(BaseIntegrationTest):

@classmethod
def initialize_runtime(cls, runtime: Runtime) -> None:
action = CmdRunAction(command='mkdir -p /workspace', keep_prompt=False)
action = CmdRunAction(command='mkdir -p /workspace')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/hello.sh exists
action = CmdRunAction(command='cat /workspace/test.txt', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/test.txt')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand All @@ -26,7 +26,7 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
)

# execute the script
action = CmdRunAction(command='cat /workspace/test.txt', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/test.txt')
obs = runtime.run_action(action)

if obs.exit_code != 0:
Expand Down
14 changes: 6 additions & 8 deletions evaluation/integration_tests/tests/t04_git_staging.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,29 @@ class Test(BaseIntegrationTest):

@classmethod
def initialize_runtime(cls, runtime: Runtime) -> None:
action = CmdRunAction(command='mkdir -p /workspace', keep_prompt=False)
action = CmdRunAction(command='mkdir -p /workspace')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

# git init
action = CmdRunAction(command='git init', keep_prompt=False)
action = CmdRunAction(command='git init')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

# create README.md
action = CmdRunAction(
command='echo \'print("hello world")\' > hello.py', keep_prompt=False
)
action = CmdRunAction(command='echo \'print("hello world")\' > hello.py')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

# git add README.md
action = CmdRunAction(command='git add hello.py', keep_prompt=False)
action = CmdRunAction(command='git add hello.py')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/hello.py exists
action = CmdRunAction(command='cat /workspace/hello.py', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/hello.py')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand All @@ -43,7 +41,7 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
)

# check if the staging area is empty
action = CmdRunAction(command='git status', keep_prompt=False)
action = CmdRunAction(command='git status')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand Down
Loading