Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt-4-vision-preview integration #241

Open
TimeLordRaps opened this issue Feb 25, 2024 · 0 comments
Open

gpt-4-vision-preview integration #241

TimeLordRaps opened this issue Feb 25, 2024 · 0 comments
Labels
feature request New feature or request

Comments

@TimeLordRaps
Copy link

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Integrate gpt-4-vision and more generally visual language models using LangChain, by first grabbing a screenshot of the application, and then use https://github.com/reworkd/tarsier to tag the image, allowing the visual model to navigate the app.

Describe alternatives you've considered
Not using tarsier, but then the models aren't pseudo-grounded.

Additional context
I'm interested specifically in the CogAgent and SeeClick models currently as they seem to be performing best on visual langauge modeling tasks for web app navigation.

@TimeLordRaps TimeLordRaps added the feature request New feature or request label Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant