Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74

andyccliao · 2024-01-07T01:05:33Z

In reworking Ollama support for LLaVa, I found the pipeline for Multimodal chats to be unnecessary. Ollama only requires the image to be attached to the message being sent. The multimodal model takes care of the rest.

Currently the vision chat pipeline seems to be for separate Vision/CLIP models and LLMs (one describes the picture, and then the returned result is put into an LLM).

andyccliao · 2024-01-19T23:33:27Z

Any thoughts on what to do for the Vision prompt? It seems to be unnecessary.

I was thinking the default behavior will be to use the normal System prompt, and the Vision prompt could be hidden behind a checkbox, i.e. turn it into a custom Vision prompt.

andyccliao · 2024-01-23T01:43:35Z

Also, was the desired UX for vision to be:

Press the "Take Picture" button to append the picture to the next message, then send the message to get a response.
Press the "Take Picture" button to send the picture and get a response, without sending any text to accompany the picture. (Still sending the rest of the chat history, just no new text.)
Type something in the chatbox, then press the "Take Picture" button to send both the picture and message at the same time.

The way the UX works right now, it works closest to to option 2, but it appends the response from the vision model to the previous chat message. (By the way, the text in the chatbox gets completely discarded.)

I think the best way to make Amica be as similar to chatting as possible is to allow both 2 and 1. My chat habits are often to send an image and then type something quickly, send an image alone, or append images to my message before sending.

When I was trying to rewrite the vision pipeline, I ran into trouble deciding how it should be implemented, and I realized that it would depend on the UX. So, any opinions on this matter?

slowsynapse · 2024-02-18T21:27:57Z

Arbius has a $200 AIUS bounty for this issue!

Brief: Complete the desired UX vision to allow take picture + text and get response as well. Rework pipeline as outlined.

Please read carefully:

To begin work on a bounty, reply by saying “I claim this bounty” - you will have 48 hours to submit your PR before someone else may attempt to claim this bounty.

To complete the bounty, within 48 hours of claiming, reply with a link to your PR referencing this issue and an Ethereum address. You must comply with reviewers comments and have the PR merged to receive the bounty reward. Please be sure to focus on quality submissions to minimize the amount of time reviewers must take.

slowsynapse added the bounty label Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74

Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74

andyccliao commented Jan 7, 2024

andyccliao commented Jan 19, 2024

andyccliao commented Jan 23, 2024 •

edited

slowsynapse commented Feb 18, 2024

Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74

Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74

Comments

andyccliao commented Jan 7, 2024

andyccliao commented Jan 19, 2024

andyccliao commented Jan 23, 2024 • edited

slowsynapse commented Feb 18, 2024

andyccliao commented Jan 23, 2024 •

edited