gpt-4-vision-preview integration #241

TimeLordRaps · 2024-02-25T13:53:01Z

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Integrate gpt-4-vision and more generally visual language models using LangChain, by first grabbing a screenshot of the application, and then use https://github.com/reworkd/tarsier to tag the image, allowing the visual model to navigate the app.

Describe alternatives you've considered
Not using tarsier, but then the models aren't pseudo-grounded.

Additional context
I'm interested specifically in the CogAgent and SeeClick models currently as they seem to be performing best on visual langauge modeling tasks for web app navigation.

TimeLordRaps added the feature request New feature or request label Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt-4-vision-preview integration #241

gpt-4-vision-preview integration #241

TimeLordRaps commented Feb 25, 2024

gpt-4-vision-preview integration #241

gpt-4-vision-preview integration #241

Comments

TimeLordRaps commented Feb 25, 2024