Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

Closed
hmhieu18 opened this issue Nov 17, 2024 · 5 comments
Closed

Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

hmhieu18 opened this issue Nov 17, 2024 · 5 comments

Comments

@hmhieu18
Copy link

Could you clarify why the instruction states, "You SHOULD NEVER attempt to browse the web," when RUN_WITH_BROWSING is set to True?

This seems conflict to the variable's name.

if RUN_WITH_BROWSING:
instruction += (
'<IMPORTANT!>\n'
'You SHOULD NEVER attempt to browse the web. '
'</IMPORTANT!>\n'
)
return instruction

@enyst
Copy link
Collaborator

enyst commented Nov 18, 2024

In my understanding, SWE-bench has some rules about that: they state that the agent must not have web browsing at all, or if it does, the agent developer must have made sure it won't use it to look up solutions. 😅

You may want to see their checklist and discussion here: https://github.com/swe-bench/experiments/blob/main/checklist.md

@hmhieu18
Copy link
Author

Thanks for your clarification! May I confirm whether your solution for SWE-bench avoids web browsing entirely, or does it use web browsing while ensuring it does not look up provided solutions? 😃

@enyst
Copy link
Collaborator

enyst commented Nov 18, 2024

Xingyao knows better. @xingyaoww could you please take a look at this question?

What I can say is, if it's enabled at all, it really is up to the LLM: a few weeks ago, during some random debugging session (not our official evals), I literally happened to be staring at the logs almost in real time when Deepseek said something like "well this isn't working, let me browse on the web..." and I'm like wait, what. 😅

Gotta love LLMs. Hey, probabilistic behavior we wanted, probabilistic behavior we got! 😂

@xingyaoww
Copy link
Collaborator

May I confirm whether your solution for SWE-bench avoids web browsing entirely, or does it use web browsing while ensuring it does not look up provided solutions? 😃

Yep - we did not enable browsing tool for our eval & benchmark.

The reason why RUN_WITH_BROWSING is there is just to make sure our (very long) browsing prompt doesn't degrade the SWE-Bench performance.

@hmhieu18
Copy link
Author

Thanks @enyst and @xingyaoww for your confirmation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants