-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093
Comments
In my understanding, SWE-bench has some rules about that: they state that the agent must not have web browsing at all, or if it does, the agent developer must have made sure it won't use it to look up solutions. 😅 You may want to see their checklist and discussion here: https://github.com/swe-bench/experiments/blob/main/checklist.md |
Thanks for your clarification! May I confirm whether your solution for SWE-bench avoids web browsing entirely, or does it use web browsing while ensuring it does not look up provided solutions? 😃 |
Xingyao knows better. @xingyaoww could you please take a look at this question? What I can say is, if it's enabled at all, it really is up to the LLM: a few weeks ago, during some random debugging session (not our official evals), I literally happened to be staring at the logs almost in real time when Deepseek said something like "well this isn't working, let me browse on the web..." and I'm like wait, what. 😅 Gotta love LLMs. Hey, probabilistic behavior we wanted, probabilistic behavior we got! 😂 |
Yep - we did not enable browsing tool for our eval & benchmark. The reason why RUN_WITH_BROWSING is there is just to make sure our (very long) browsing prompt doesn't degrade the SWE-Bench performance. |
Thanks @enyst and @xingyaoww for your confirmation. |
Could you clarify why the instruction states, "You SHOULD NEVER attempt to browse the web," when RUN_WITH_BROWSING is set to True?
This seems conflict to the variable's name.
OpenHands/evaluation/swe_bench/run_infer.py
Lines 94 to 100 in 088e895
The text was updated successfully, but these errors were encountered: