Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to rework Multimodal pipeline for ollama, maybe for other APIs as well #74

Open
andyccliao opened this issue Jan 7, 2024 · 3 comments
Labels

Comments

@andyccliao
Copy link
Contributor

In reworking Ollama support for LLaVa, I found the pipeline for Multimodal chats to be unnecessary. Ollama only requires the image to be attached to the message being sent. The multimodal model takes care of the rest.

Currently the vision chat pipeline seems to be for separate Vision/CLIP models and LLMs (one describes the picture, and then the returned result is put into an LLM).

@andyccliao
Copy link
Contributor Author

Any thoughts on what to do for the Vision prompt? It seems to be unnecessary.

I was thinking the default behavior will be to use the normal System prompt, and the Vision prompt could be hidden behind a checkbox, i.e. turn it into a custom Vision prompt.

@andyccliao
Copy link
Contributor Author

andyccliao commented Jan 23, 2024

Also, was the desired UX for vision to be:

  1. Press the "Take Picture" button to append the picture to the next message, then send the message to get a response.
  2. Press the "Take Picture" button to send the picture and get a response, without sending any text to accompany the picture. (Still sending the rest of the chat history, just no new text.)
  3. Type something in the chatbox, then press the "Take Picture" button to send both the picture and message at the same time.

The way the UX works right now, it works closest to to option 2, but it appends the response from the vision model to the previous chat message. (By the way, the text in the chatbox gets completely discarded.)

I think the best way to make Amica be as similar to chatting as possible is to allow both 2 and 1. My chat habits are often to send an image and then type something quickly, send an image alone, or append images to my message before sending.

When I was trying to rewrite the vision pipeline, I ran into trouble deciding how it should be implemented, and I realized that it would depend on the UX. So, any opinions on this matter?

@slowsynapse
Copy link
Collaborator

Arbius has a $200 AIUS bounty for this issue!

Brief: Complete the desired UX vision to allow take picture + text and get response as well. Rework pipeline as outlined.

Please read carefully:

To begin work on a bounty, reply by saying “I claim this bounty” - you will have 48 hours to submit your PR before someone else may attempt to claim this bounty.

To complete the bounty, within 48 hours of claiming, reply with a link to your PR referencing this issue and an Ethereum address. You must comply with reviewers comments and have the PR merged to receive the bounty reward. Please be sure to focus on quality submissions to minimize the amount of time reviewers must take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants