Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for model multimodal #564

Open
Jhonnyr97 opened this issue Nov 13, 2023 · 8 comments
Open

Support for model multimodal #564

Jhonnyr97 opened this issue Nov 13, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@Jhonnyr97
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I'm frustrated when I can't use multimodal models like "gpt-4-vision-preview" in Cheshire-cat-ai to process and retrieve information from images via the API. Additionally, the current vector database should support image retrieval.

Describe the solution you'd like
I would like to see support for multimodal models, specifically the "gpt-4-vision-preview" model, integrated into Cheshire-cat-ai. This integration should allow users to send images via the Cheshire-cat-ai API and receive responses or results based on both text and images.

Furthermore, I'd like to utilize the existing vector database to enable Cheshire-cat-ai to perform retrieval with images. This means users should be able to search for information within the database using both text and images as search keys.

This feature would significantly enhance Cheshire-cat-ai's capabilities, enabling better understanding and generation of multimodal content. It's particularly valuable in scenarios where information is presented in both text and image formats.

Describe alternatives you've considered
I've considered alternative solutions, but integrating multimodal models and image retrieval directly into Cheshire-cat-ai seems to be the most straightforward and effective approach. Other alternatives may require external tools or complex workarounds.

Additional context
No additional context at this time, but this feature would greatly enhance Cheshire-cat-ai's versatility and utility.

@Jhonnyr97 Jhonnyr97 added the enhancement New feature or request label Nov 13, 2023
@nickprock
Copy link
Contributor

Hi @Jhonnyr97 the multimodal cat is planned.
If you're able to help us in the development you're welcome!

@Jhonnyr97
Copy link
Contributor Author

okay, where can I find the documentation for multimodal?

@nickprock
Copy link
Contributor

for the time being I am trying to throw down a list of links, as soon as I have discussed it with other core-devs I will share it in this issue.
Meanwhile you can look to see if langchian multimodal allows you to use the model you are interested in and these wonderful plugins artistic_cat and WhisperingCat

@pieroit you can assign this issue to me.

@nickprock
Copy link
Contributor

20231124_061834.jpg

Multimodality flow by LlamaIndex

@pieroit
Copy link
Member

pieroit commented Nov 24, 2023

@nickprock we can setup an image embedder module like the text embedder we already have

Not clear to me yet how to crossindex texts and images

@nickprock
Copy link
Contributor

@pieroit the image is a placeholder for me 😅 I promise you that I will arrive at the multimodality meeting after studying the problem.

@nicola-corbellini
Copy link
Member

nicola-corbellini commented Nov 24, 2023

Here it seems they are embedding with two separate models (CLIP and Ada) in two different collections and then they retrieve from each using the double embedded query, isn't it?

@nickprock
Copy link
Contributor

Yes, I must check the Qdrant doc for multimodal storage and retrieve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 📋 Backlog
Development

No branches or pull requests

4 participants