Is it possible to search at the Document-level (instead of Page-level)? #125

steve-marmalade · 2024-12-27T20:11:04Z

Hello! The API docs for Search state that it allows you to:

Search for pages similar to a given query.

and the API response is a list of pages. I would like to run the MaxSim operations across all embeddings from all pages in a document, and return the list of most similar documents for the query.

The scenario I'm envisioning is one in which multiple pages, together, provide important context that's missing from any single page. And so the most relevant page may not belong to the most relevant document.

Thanks!

Jonathan-Adly · 2024-12-28T20:34:45Z

Hey Steve -

You do get document details when you search. So, you can do your search with relatively high k (5-10 for example) - and then - from there, get the document with most pages.

For example, given a query: What is the work from home policy?

The most relevant page might be from document with id=1, but the next 4 pages are from document with id=2. Then this is your most relevant document.

Here is some sample code in Python:

from collections import Counter

# Sample API response
response = {
    "query": "text",
    "results": [
        {
            "collection_name": "text",
            "collection_id": 0,
            "document_name": "Document A",
            "document_id": 1,
            "page_number": 0,
            "raw_score": 0,
            "normalized_score": 0,
            "img_base64": "text"
        },
        {
            "collection_name": "text",
            "collection_id": 0,
            "document_name": "Document B",
            "document_id": 2,
            "page_number": 0,
            "raw_score": 0,
            "normalized_score": 0,
            "img_base64": "text"
        },
        {
            "collection_name": "text",
            "collection_id": 0,
            "document_name": "Document A",
            "document_id": 1,
            "page_number": 0,
            "raw_score": 0,
            "normalized_score": 0,
            "img_base64": "text"
        }
    ]
}

# Extract document_id and document_name pairs
document_pairs = [(result["document_id"], result["document_name"]) for result in response["results"]]

# Count occurrences of each pair
counter = Counter(document_pairs)

# Find the most common pair
most_common_pair, count = counter.most_common(1)[0]

document_id, document_name = most_common_pair

print(f"Most common document_id: {document_id}, document_name: {document_name} (repeated {count} times)")

This way - you have some flexibility and you can filter further from your end. I do think this is a useful feature, and we will likely put this code in our SDKs at some point.

simjak · 2024-12-29T13:22:44Z

Is there a need to rerank the responses with a Reranking model (e.g. Cohere)?

Jonathan-Adly · 2024-12-29T21:25:22Z

No - we bench-marked reranking and it is worse. You increase the latency (because you are pulling the vectors from storage twice), without any improvements in the accuracy.

https://blog.colivara.com/from-cosine-to-dot-benchmarking-similarity-methods-for-speed-and-precision

steve-marmalade · 2025-01-07T15:21:42Z

Hi @Jonathan-Adly , thank you for the detailed response. For reference, I also asked this question in ColPali since it seemed like a more general request.

You do get document details when you search. So, you can do your search with relatively high k (5-10 for example) - and then - from there, get the document with most pages.

As a heuristic / work-around, I think this is very reasonable. However, it still doesn't guarantee we get the most relevant document. Consider the extreme case where page1 of a document has the highest similarity score (of all pages in the corpus) to the first embedding of a given query multi-embedding but has very low similarity to all of the other indices in the multi-embedding, page2 of the same document has the highest similarity score (of all pages in the corpus) to the second embedding of the given query multi-embedding but has very low similarity for all of the others, and so on. In this case, even with k=10, we might not retrieve any of these pages (because their maxsim for the entire query is low) even though they have a high similarity to a single embedding within the query and, together, would maximize similarity over the query.

This is why I think it might be nice to have a "document mode" where the page-level embeddings are concatenated before the maxsim operation is run.

I understand that this may not be a priority for you (since maybe most pages in the domain you are working in contain most of the information needed to match it a query), but hopefully this at least makes sense?

Jonathan-Adly · 2025-01-07T15:36:20Z

Makes sense - I think the best way to implement this as a fool-proof way, is at the model level. From our side, I think it will be either this workaround or another workaround.

steve-marmalade · 2025-01-08T21:12:28Z

Hi @Jonathan-Adly - sorry I didn't quite follow your last comment. Could you expand on what you mean by at the model level? Is there an external API e.g. in colpali_engine that you'd expect support for this to be added?

Jonathan-Adly · 2025-01-08T21:23:19Z

The best way to support this would be to change the ColPali implementation as it is trained on pages, basically what Manu from colpali said. Max-sim happen on the page level. So, all other solutions would be a variation on the code above and would fail some edge cases.

For example, instead of counting pages, you can aggregate the scores of all pages under each document and pick the highest document. But this also a workaround - with some limitations.

steve-marmalade · 2025-01-08T21:41:56Z

Thanks.

As I've learned more about this, I think that approaches like Late Chunking (already being discussed in colpali in reference to this excellent jinai overview) are looking to address the issue I've raised here (and seem promising).

Makes sense that from your end you'll wait to see what shakes out upstream.

Jonathan-Adly closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to search at the Document-level (instead of Page-level)? #125

Is it possible to search at the Document-level (instead of Page-level)? #125

steve-marmalade commented Dec 27, 2024

Jonathan-Adly commented Dec 28, 2024

simjak commented Dec 29, 2024

Jonathan-Adly commented Dec 29, 2024

steve-marmalade commented Jan 7, 2025 •

edited

Loading

Jonathan-Adly commented Jan 7, 2025

steve-marmalade commented Jan 8, 2025

Jonathan-Adly commented Jan 8, 2025

steve-marmalade commented Jan 8, 2025

Is it possible to search at the Document-level (instead of Page-level)? #125

Is it possible to search at the Document-level (instead of Page-level)? #125

Comments

steve-marmalade commented Dec 27, 2024

Jonathan-Adly commented Dec 28, 2024

simjak commented Dec 29, 2024

Jonathan-Adly commented Dec 29, 2024

steve-marmalade commented Jan 7, 2025 • edited Loading

Jonathan-Adly commented Jan 7, 2025

steve-marmalade commented Jan 8, 2025

Jonathan-Adly commented Jan 8, 2025

steve-marmalade commented Jan 8, 2025

steve-marmalade commented Jan 7, 2025 •

edited

Loading