Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

hmhieu18 · 2024-11-17T20:09:36Z

Could you clarify why the instruction states, "You SHOULD NEVER attempt to browse the web," when RUN_WITH_BROWSING is set to True?

This seems conflict to the variable's name.

OpenHands/evaluation/swe_bench/run_infer.py

Lines 94 to 100 in 088e895

    
           if RUN_WITH_BROWSING: 
        
               instruction += ( 
        
                   '<IMPORTANT!>\n' 
        
                   'You SHOULD NEVER attempt to browse the web. ' 
        
                   '</IMPORTANT!>\n' 
        
               ) 
        
           return instruction

enyst · 2024-11-18T00:18:54Z

In my understanding, SWE-bench has some rules about that: they state that the agent must not have web browsing at all, or if it does, the agent developer must have made sure it won't use it to look up solutions. 😅

You may want to see their checklist and discussion here: https://github.com/swe-bench/experiments/blob/main/checklist.md

hmhieu18 · 2024-11-18T05:22:03Z

Thanks for your clarification! May I confirm whether your solution for SWE-bench avoids web browsing entirely, or does it use web browsing while ensuring it does not look up provided solutions? 😃

enyst · 2024-11-18T15:43:45Z

Xingyao knows better. @xingyaoww could you please take a look at this question?

What I can say is, if it's enabled at all, it really is up to the LLM: a few weeks ago, during some random debugging session (not our official evals), I literally happened to be staring at the logs almost in real time when Deepseek said something like "well this isn't working, let me browse on the web..." and I'm like wait, what. 😅

Gotta love LLMs. Hey, probabilistic behavior we wanted, probabilistic behavior we got! 😂

xingyaoww · 2024-11-18T15:56:49Z

May I confirm whether your solution for SWE-bench avoids web browsing entirely, or does it use web browsing while ensuring it does not look up provided solutions? 😃

Yep - we did not enable browsing tool for our eval & benchmark.

The reason why RUN_WITH_BROWSING is there is just to make sure our (very long) browsing prompt doesn't degrade the SWE-Bench performance.

hmhieu18 · 2024-11-18T17:56:39Z

Thanks @enyst and @xingyaoww for your confirmation.

hmhieu18 closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

hmhieu18 commented Nov 17, 2024

enyst commented Nov 18, 2024

hmhieu18 commented Nov 18, 2024

enyst commented Nov 18, 2024

xingyaoww commented Nov 18, 2024

hmhieu18 commented Nov 18, 2024

Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

Conflicting Behavior Between RUN_WITH_BROWSING and Instruction #5093

Comments

hmhieu18 commented Nov 17, 2024

enyst commented Nov 18, 2024

hmhieu18 commented Nov 18, 2024

enyst commented Nov 18, 2024

xingyaoww commented Nov 18, 2024

hmhieu18 commented Nov 18, 2024