Support for model multimodal #564

Jhonnyr97 · 2023-11-13T14:23:43Z

Is your feature request related to a problem? Please describe.
I'm frustrated when I can't use multimodal models like "gpt-4-vision-preview" in Cheshire-cat-ai to process and retrieve information from images via the API. Additionally, the current vector database should support image retrieval.

Describe the solution you'd like
I would like to see support for multimodal models, specifically the "gpt-4-vision-preview" model, integrated into Cheshire-cat-ai. This integration should allow users to send images via the Cheshire-cat-ai API and receive responses or results based on both text and images.

Furthermore, I'd like to utilize the existing vector database to enable Cheshire-cat-ai to perform retrieval with images. This means users should be able to search for information within the database using both text and images as search keys.

This feature would significantly enhance Cheshire-cat-ai's capabilities, enabling better understanding and generation of multimodal content. It's particularly valuable in scenarios where information is presented in both text and image formats.

Describe alternatives you've considered
I've considered alternative solutions, but integrating multimodal models and image retrieval directly into Cheshire-cat-ai seems to be the most straightforward and effective approach. Other alternatives may require external tools or complex workarounds.

Additional context
No additional context at this time, but this feature would greatly enhance Cheshire-cat-ai's versatility and utility.

nickprock · 2023-11-13T14:32:51Z

Hi @Jhonnyr97 the multimodal cat is planned.
If you're able to help us in the development you're welcome!

Jhonnyr97 · 2023-11-13T14:35:15Z

okay, where can I find the documentation for multimodal?

nickprock · 2023-11-13T14:41:43Z

for the time being I am trying to throw down a list of links, as soon as I have discussed it with other core-devs I will share it in this issue.
Meanwhile you can look to see if langchian multimodal allows you to use the model you are interested in and these wonderful plugins artistic_cat and WhisperingCat

@pieroit you can assign this issue to me.

nickprock · 2023-11-24T06:00:41Z

Multimodality flow by LlamaIndex

pieroit · 2023-11-24T16:19:50Z

@nickprock we can setup an image embedder module like the text embedder we already have

Not clear to me yet how to crossindex texts and images

nickprock · 2023-11-24T16:24:41Z

@pieroit the image is a placeholder for me 😅 I promise you that I will arrive at the multimodality meeting after studying the problem.

nicola-corbellini · 2023-11-24T17:10:14Z

Here it seems they are embedding with two separate models (CLIP and Ada) in two different collections and then they retrieve from each using the double embedded query, isn't it?

nickprock · 2023-11-24T17:13:44Z

Yes, I must check the Qdrant doc for multimodal storage and retrieve.

Jhonnyr97 added the enhancement New feature or request label Nov 13, 2023

pieroit assigned nickprock Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for model multimodal #564

Support for model multimodal #564

Jhonnyr97 commented Nov 13, 2023

nickprock commented Nov 13, 2023

Jhonnyr97 commented Nov 13, 2023

nickprock commented Nov 13, 2023

nickprock commented Nov 24, 2023

pieroit commented Nov 24, 2023

nickprock commented Nov 24, 2023

nicola-corbellini commented Nov 24, 2023 •

edited

nickprock commented Nov 24, 2023

Support for model multimodal #564

Support for model multimodal #564

Comments

Jhonnyr97 commented Nov 13, 2023

nickprock commented Nov 13, 2023

Jhonnyr97 commented Nov 13, 2023

nickprock commented Nov 13, 2023

nickprock commented Nov 24, 2023

pieroit commented Nov 24, 2023

nickprock commented Nov 24, 2023

nicola-corbellini commented Nov 24, 2023 • edited

nickprock commented Nov 24, 2023

nicola-corbellini commented Nov 24, 2023 •

edited