-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74
Comments
Any thoughts on what to do for the Vision prompt? It seems to be unnecessary. I was thinking the default behavior will be to use the normal System prompt, and the Vision prompt could be hidden behind a checkbox, i.e. turn it into a custom Vision prompt. |
Also, was the desired UX for vision to be:
The way the UX works right now, it works closest to to option 2, but it appends the response from the vision model to the previous chat message. (By the way, the text in the chatbox gets completely discarded.) I think the best way to make Amica be as similar to chatting as possible is to allow both 2 and 1. My chat habits are often to send an image and then type something quickly, send an image alone, or append images to my message before sending. When I was trying to rewrite the vision pipeline, I ran into trouble deciding how it should be implemented, and I realized that it would depend on the UX. So, any opinions on this matter? |
Arbius has a $200 AIUS bounty for this issue! Brief: Complete the desired UX vision to allow take picture + text and get response as well. Rework pipeline as outlined. Please read carefully: To begin work on a bounty, reply by saying “I claim this bounty” - you will have 48 hours to submit your PR before someone else may attempt to claim this bounty. To complete the bounty, within 48 hours of claiming, reply with a link to your PR referencing this issue and an Ethereum address. You must comply with reviewers comments and have the PR merged to receive the bounty reward. Please be sure to focus on quality submissions to minimize the amount of time reviewers must take. |
In reworking Ollama support for LLaVa, I found the pipeline for Multimodal chats to be unnecessary. Ollama only requires the image to be attached to the message being sent. The multimodal model takes care of the rest.
Currently the vision chat pipeline seems to be for separate Vision/CLIP models and LLMs (one describes the picture, and then the returned result is put into an LLM).
The text was updated successfully, but these errors were encountered: