[feat(backend)] Alignment checker for browsing agent #5105

s-aniruddha · 2024-11-18T17:31:35Z

End-user friendly description of the problem this fixes or functionality that this introduces
Recent work (https://scale.com/research/browser-art) showed that agentic LLMs do not refuse harmful instructions even though the backbone LLMs do. In order to combat this, we introduce a guardrail that detects and prevents unsafe behaviour by the browsing agent.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Guardrail feature that uses the underlying LLM of the agent to:
* Examine the user's request and check if it is harmful.
* Examine the content entered by the agent in a textbox (argument of the “fill” browser action) and check if it is harmful.
If the guardrail evaluates either of the 2 conditions to be true, it emits a change_agent_state action and transforms the AgentState to ERROR. This stops the agent from proceeding further. To enable this feature: In the InvariantAnalyzer object, set the check_browsing_alignment attribute to True and initialise the guardrail_llm attribute with an LLM object.

Link of any specific issues this addresses

neubig

This looks great, thanks so much!

Do you have actual evaluation results on BrowserArt? Or is that still pending?

… into browserart_defence

mamoodi · 2024-11-25T21:31:37Z

Hey @s-aniruddha just wanted to check that you saw neubig's comment.

s-aniruddha · 2024-11-26T10:03:37Z

Hey @s-aniruddha just wanted to check that you saw neubig's comment.

Hello @mamoodi! Yes, I saw @neubig 's comment and presented a longer response in slack. Here's an abridged version of the response:

Summary of results:

Experiments conducted on a more recent version of OpenHands with GPT-4o as backbone LLM
Blocks 94/100 of the harmful tasks in direct ask mode
Permits 9/10 of the benign tasks

neubig

This looks great, thank you for the contribution!

s-aniruddha added 4 commits November 18, 2024 14:39

added usertask and fillaction checks for browsing agent alignment

0697fe9

added tests for usertask and fillaction checker

f27f597

changed judge_llm to gaurdrail_llm

30f43a7

Added description of browsing agent guardrails to README.md

e4dfb1c

neubig reviewed Nov 18, 2024

View reviewed changes

s-aniruddha and others added 13 commits November 19, 2024 14:29

Merge branch 'main' into browserart_defence

91bfbe9

Added newline at end of test_security.py

ee06358

Merge branch 'browserart_defence' of github.com:s-aniruddha/OpenHands…

125d70b

… into browserart_defence

Removed namedtuple import, not needed

7beb90a

Merge branch 'main' into browserart_defence

75b80e8

Merge branch 'main' into browserart_defence

15e2489

changes suggested by lint python

940f752

Merge branch 'main' into browserart_defence

c64f51b

Merge branch 'browserart_defence' of github.com:s-aniruddha/OpenHands…

91ee0bd

… into browserart_defence

removed unnecessarywhitespaces in security.py

53501ee

fix browser fill action parsing

dd40618

Merge branch 'main' into browserart_defence

545249c

added new line at the end of readme

f534eb9

neubig approved these changes Nov 27, 2024

View reviewed changes

neubig enabled auto-merge (squash) November 27, 2024 22:04

neubig merged commit 4374b4a into All-Hands-AI:main Nov 27, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat(backend)] Alignment checker for browsing agent #5105

[feat(backend)] Alignment checker for browsing agent #5105

s-aniruddha commented Nov 18, 2024 •

edited

Loading

neubig left a comment

mamoodi commented Nov 25, 2024

s-aniruddha commented Nov 26, 2024

neubig left a comment

[feat(backend)] Alignment checker for browsing agent #5105

[feat(backend)] Alignment checker for browsing agent #5105

Conversation

s-aniruddha commented Nov 18, 2024 • edited Loading

neubig left a comment

Choose a reason for hiding this comment

mamoodi commented Nov 25, 2024

s-aniruddha commented Nov 26, 2024

neubig left a comment

Choose a reason for hiding this comment

s-aniruddha commented Nov 18, 2024 •

edited

Loading