-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to search at the Document-level (instead of Page-level)? #125
Comments
Hey Steve - You do get document details when you search. So, you can do your search with relatively high k (5-10 for example) - and then - from there, get the document with most pages. For example, given a query: What is the work from home policy? The most relevant page might be from document with id=1, but the next 4 pages are from document with id=2. Then this is your most relevant document. Here is some sample code in Python:
This way - you have some flexibility and you can filter further from your end. I do think this is a useful feature, and we will likely put this code in our SDKs at some point. |
Is there a need to rerank the responses with a Reranking model (e.g. Cohere)? |
No - we bench-marked reranking and it is worse. You increase the latency (because you are pulling the vectors from storage twice), without any improvements in the accuracy. https://blog.colivara.com/from-cosine-to-dot-benchmarking-similarity-methods-for-speed-and-precision |
Hi @Jonathan-Adly , thank you for the detailed response. For reference, I also asked this question in ColPali since it seemed like a more general request.
As a heuristic / work-around, I think this is very reasonable. However, it still doesn't guarantee we get the most relevant document. Consider the extreme case where page1 of a document has the highest similarity score (of all pages in the corpus) to the first embedding of a given query multi-embedding but has very low similarity to all of the other indices in the multi-embedding, page2 of the same document has the highest similarity score (of all pages in the corpus) to the second embedding of the given query multi-embedding but has very low similarity for all of the others, and so on. In this case, even with k=10, we might not retrieve any of these pages (because their maxsim for the entire query is low) even though they have a high similarity to a single embedding within the query and, together, would maximize similarity over the query. This is why I think it might be nice to have a "document mode" where the page-level embeddings are concatenated before the maxsim operation is run. I understand that this may not be a priority for you (since maybe most pages in the domain you are working in contain most of the information needed to match it a query), but hopefully this at least makes sense? |
Makes sense - I think the best way to implement this as a fool-proof way, is at the model level. From our side, I think it will be either this workaround or another workaround. |
Hi @Jonathan-Adly - sorry I didn't quite follow your last comment. Could you expand on what you mean by at the model level? Is there an external API e.g. in |
The best way to support this would be to change the ColPali implementation as it is trained on pages, basically what Manu from colpali said. Max-sim happen on the page level. So, all other solutions would be a variation on the code above and would fail some edge cases. For example, instead of counting pages, you can aggregate the scores of all pages under each document and pick the highest document. But this also a workaround - with some limitations. |
Thanks. As I've learned more about this, I think that approaches like Late Chunking (already being discussed in colpali in reference to this excellent jinai overview) are looking to address the issue I've raised here (and seem promising). Makes sense that from your end you'll wait to see what shakes out upstream. |
Hello! The API docs for Search state that it allows you to:
and the API response is a list of pages. I would like to run the MaxSim operations across all embeddings from all pages in a document, and return the list of most similar documents for the query.
The scenario I'm envisioning is one in which multiple pages, together, provide important context that's missing from any single page. And so the most relevant page may not belong to the most relevant document.
Thanks!
The text was updated successfully, but these errors were encountered: