Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent] Support browser control via screenshots #4570

Open
xingyaoww opened this issue Oct 25, 2024 · 13 comments
Open

[Agent] Support browser control via screenshots #4570

xingyaoww opened this issue Oct 25, 2024 · 13 comments
Assignees
Labels
browser Related to the Browser agent in OpenHands enhancement New feature or request
Milestone

Comments

@xingyaoww
Copy link
Collaborator

What problem or use case are you trying to solve?

Implement a tool similar to the computer tool & allow it to control the browser directly.

Describe the UX of the solution you'd like

Do you have thoughts on the technical implementation?

Describe alternatives you've considered

Additional context

@xingyaoww xingyaoww added the enhancement New feature or request label Oct 25, 2024
@xingyaoww xingyaoww added this to the 2024-11 milestone Oct 25, 2024
@x66ccff
Copy link

x66ccff commented Oct 27, 2024

here is what you need #4581

@rbren rbren moved this to In Progress in OpenHands Roadmap Nov 8, 2024
@xingyaoww
Copy link
Collaborator Author

@ryanhoangt Can you also self-assign this one?

@rbren rbren modified the milestones: 2024-11, 2024-12 Nov 22, 2024
@ryx2
Copy link
Contributor

ryx2 commented Nov 22, 2024

Why not just use computer use directly?

@ryanhoangt
Copy link
Contributor

I think we may want to first try to use computer use as another approach to implement browsing capability -- currently it's based on text-based observation only. If it helps achieve performance boost, we can stick with it by default (or for claude), and fall back to old browsing implementation for other models. Not sure if the team has some other ideas to share on this.

@xingyaoww
Copy link
Collaborator Author

Hey @ryx2 - I think that's a good idea and i've been discussing with @ryanhoangt to make computer-control the next low hanging fruit we could pursue to improve browsing experience (at least for using claude)

@tobitege
Copy link
Collaborator

tobitege commented Nov 23, 2024

Computer use can become extremely expensive if screenshots (images) are being used, compared to text-only approach.
Just took a look at e.g. OpenRouter (edit: $/K images):
Sonnet-3.5: $4.8
Gpt-4o: $3.613
Gpt-4o-mini: $7.225
Gemini Flash 1.5: $0.04
Gemini Pro 1.5: $0.675
Might be helpful, if the preferred vision model could be defined somehow then.

@ryanhoangt
Copy link
Contributor

ryanhoangt commented Nov 23, 2024

I looked into OpenRouter pricing and seems like it's $4.8 / 1k images for Sonnet-3.5 🤔

@tobitege
Copy link
Collaborator

Hmm you have a link to where it says per "1K"?

@ryanhoangt
Copy link
Contributor

@tobitege
Copy link
Collaborator

Ohh, you're right, I missed that notation.

@x66ccff
Copy link

x66ccff commented Nov 23, 2024

Qwen2.5-VL is good, if you guys are concern about the price. It beats the old version 4o and sonnet3.5.

see https://qwenlm.github.io/blog/qwen2-vl/

it can be self-hosted

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Dec 26, 2024
@mamoodi mamoodi added the browser Related to the Browser agent in OpenHands label Dec 27, 2024
@github-actions github-actions bot removed the Stale Inactive for 30 days label Dec 29, 2024
@neubig neubig modified the milestones: 2024-12, 2025-01 Jan 13, 2025
@ryanhoangt
Copy link
Contributor

Reference implementation: https://github.com/invariantlabs-ai/playwright-computer-use

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
browser Related to the Browser agent in OpenHands enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

8 participants