Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat(backend)] Alignment checker for browsing agent #5105

Merged
merged 17 commits into from
Nov 27, 2024

Conversation

s-aniruddha
Copy link
Contributor

@s-aniruddha s-aniruddha commented Nov 18, 2024

End-user friendly description of the problem this fixes or functionality that this introduces
Recent work (https://scale.com/research/browser-art) showed that agentic LLMs do not refuse harmful instructions even though the backbone LLMs do. In order to combat this, we introduce a guardrail that detects and prevents unsafe behaviour by the browsing agent.

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Guardrail feature that uses the underlying LLM of the agent to:
* Examine the user's request and check if it is harmful.
* Examine the content entered by the agent in a textbox (argument of the “fill” browser action) and check if it is harmful.
If the guardrail evaluates either of the 2 conditions to be true, it emits a change_agent_state action and transforms the AgentState to ERROR. This stops the agent from proceeding further. To enable this feature: In the InvariantAnalyzer object, set the check_browsing_alignment attribute to True and initialise the guardrail_llm attribute with an LLM object.


Link of any specific issues this addresses

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks so much!

Do you have actual evaluation results on BrowserArt? Or is that still pending?

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 25, 2024

Hey @s-aniruddha just wanted to check that you saw neubig's comment.

@s-aniruddha
Copy link
Contributor Author

Hey @s-aniruddha just wanted to check that you saw neubig's comment.

Hello @mamoodi! Yes, I saw @neubig 's comment and presented a longer response in slack. Here's an abridged version of the response:

Summary of results:

  • Experiments conducted on a more recent version of OpenHands with GPT-4o as backbone LLM
  • Blocks 94/100 of the harmful tasks in direct ask mode
  • Permits 9/10 of the benign tasks

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for the contribution!

@neubig neubig enabled auto-merge (squash) November 27, 2024 22:04
@neubig neubig merged commit 4374b4a into All-Hands-AI:main Nov 27, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants