You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
Integrate gpt-4-vision and more generally visual language models using LangChain, by first grabbing a screenshot of the application, and then use https://github.com/reworkd/tarsier to tag the image, allowing the visual model to navigate the app.
Describe alternatives you've considered
Not using tarsier, but then the models aren't pseudo-grounded.
Additional context
I'm interested specifically in the CogAgent and SeeClick models currently as they seem to be performing best on visual langauge modeling tasks for web app navigation.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
Integrate gpt-4-vision and more generally visual language models using LangChain, by first grabbing a screenshot of the application, and then use https://github.com/reworkd/tarsier to tag the image, allowing the visual model to navigate the app.
Describe alternatives you've considered
Not using tarsier, but then the models aren't pseudo-grounded.
Additional context
I'm interested specifically in the CogAgent and SeeClick models currently as they seem to be performing best on visual langauge modeling tasks for web app navigation.
The text was updated successfully, but these errors were encountered: