-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat(backend)] Alignment checker for browsing agent #5105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks so much!
Do you have actual evaluation results on BrowserArt? Or is that still pending?
… into browserart_defence
… into browserart_defence
Hey @s-aniruddha just wanted to check that you saw neubig's comment. |
Hello @mamoodi! Yes, I saw @neubig 's comment and presented a longer response in slack. Here's an abridged version of the response: Summary of results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you for the contribution!
End-user friendly description of the problem this fixes or functionality that this introduces
Recent work (https://scale.com/research/browser-art) showed that agentic LLMs do not refuse harmful instructions even though the backbone LLMs do. In order to combat this, we introduce a guardrail that detects and prevents unsafe behaviour by the browsing agent.
Give a summary of what the PR does, explaining any non-trivial design decisions
Guardrail feature that uses the underlying LLM of the agent to:
* Examine the user's request and check if it is harmful.
* Examine the content entered by the agent in a textbox (argument of the “fill” browser action) and check if it is harmful.
If the guardrail evaluates either of the 2 conditions to be true, it emits a change_agent_state action and transforms the AgentState to ERROR. This stops the agent from proceeding further. To enable this feature: In the InvariantAnalyzer object, set the check_browsing_alignment attribute to True and initialise the guardrail_llm attribute with an LLM object.
Link of any specific issues this addresses