-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for model multimodal #564
Comments
Hi @Jhonnyr97 the multimodal cat is planned. |
okay, where can I find the documentation for multimodal? |
for the time being I am trying to throw down a list of links, as soon as I have discussed it with other core-devs I will share it in this issue. @pieroit you can assign this issue to me. |
@nickprock we can setup an image embedder module like the text embedder we already have Not clear to me yet how to crossindex texts and images |
@pieroit the image is a placeholder for me 😅 I promise you that I will arrive at the multimodality meeting after studying the problem. |
Here it seems they are embedding with two separate models (CLIP and Ada) in two different collections and then they retrieve from each using the double embedded query, isn't it? |
Yes, I must check the Qdrant doc for multimodal storage and retrieve. |
@nicola-corbellini as discussed in dev meeting I paste here some links as placeholder: https://docs.llamaindex.ai/en/stable/examples/multi_modal/gpt4v_multi_modal_retrieval/ https://medium.aiplanet.com/multimodal-rag-using-llamaindex-gemini-and-qdrant-f52c5b68b367 https://qdrant.tech/documentation/examples/aleph-alpha-search/ |
thank you, i'll try to take a look in the next days. |
Is your feature request related to a problem? Please describe.
I'm frustrated when I can't use multimodal models like "gpt-4-vision-preview" in Cheshire-cat-ai to process and retrieve information from images via the API. Additionally, the current vector database should support image retrieval.
Describe the solution you'd like
I would like to see support for multimodal models, specifically the "gpt-4-vision-preview" model, integrated into Cheshire-cat-ai. This integration should allow users to send images via the Cheshire-cat-ai API and receive responses or results based on both text and images.
Furthermore, I'd like to utilize the existing vector database to enable Cheshire-cat-ai to perform retrieval with images. This means users should be able to search for information within the database using both text and images as search keys.
This feature would significantly enhance Cheshire-cat-ai's capabilities, enabling better understanding and generation of multimodal content. It's particularly valuable in scenarios where information is presented in both text and image formats.
Describe alternatives you've considered
I've considered alternative solutions, but integrating multimodal models and image retrieval directly into Cheshire-cat-ai seems to be the most straightforward and effective approach. Other alternatives may require external tools or complex workarounds.
Additional context
No additional context at this time, but this feature would greatly enhance Cheshire-cat-ai's versatility and utility.
The text was updated successfully, but these errors were encountered: