Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to search at the Document-level (instead of Page-level)? #125

Closed
steve-marmalade opened this issue Dec 27, 2024 · 8 comments

Comments

@steve-marmalade
Copy link

Hello! The API docs for Search state that it allows you to:

Search for pages similar to a given query.

and the API response is a list of pages. I would like to run the MaxSim operations across all embeddings from all pages in a document, and return the list of most similar documents for the query.

The scenario I'm envisioning is one in which multiple pages, together, provide important context that's missing from any single page. And so the most relevant page may not belong to the most relevant document.

Thanks!

@Jonathan-Adly
Copy link
Contributor

Hey Steve -

You do get document details when you search. So, you can do your search with relatively high k (5-10 for example) - and then - from there, get the document with most pages.

For example, given a query: What is the work from home policy?

The most relevant page might be from document with id=1, but the next 4 pages are from document with id=2. Then this is your most relevant document.

Here is some sample code in Python:

from collections import Counter

# Sample API response
response = {
    "query": "text",
    "results": [
        {
            "collection_name": "text",
            "collection_id": 0,
            "document_name": "Document A",
            "document_id": 1,
            "page_number": 0,
            "raw_score": 0,
            "normalized_score": 0,
            "img_base64": "text"
        },
        {
            "collection_name": "text",
            "collection_id": 0,
            "document_name": "Document B",
            "document_id": 2,
            "page_number": 0,
            "raw_score": 0,
            "normalized_score": 0,
            "img_base64": "text"
        },
        {
            "collection_name": "text",
            "collection_id": 0,
            "document_name": "Document A",
            "document_id": 1,
            "page_number": 0,
            "raw_score": 0,
            "normalized_score": 0,
            "img_base64": "text"
        }
    ]
}

# Extract document_id and document_name pairs
document_pairs = [(result["document_id"], result["document_name"]) for result in response["results"]]

# Count occurrences of each pair
counter = Counter(document_pairs)

# Find the most common pair
most_common_pair, count = counter.most_common(1)[0]

document_id, document_name = most_common_pair

print(f"Most common document_id: {document_id}, document_name: {document_name} (repeated {count} times)")

This way - you have some flexibility and you can filter further from your end. I do think this is a useful feature, and we will likely put this code in our SDKs at some point.

@simjak
Copy link

simjak commented Dec 29, 2024

Is there a need to rerank the responses with a Reranking model (e.g. Cohere)?

@Jonathan-Adly
Copy link
Contributor

No - we bench-marked reranking and it is worse. You increase the latency (because you are pulling the vectors from storage twice), without any improvements in the accuracy.

https://blog.colivara.com/from-cosine-to-dot-benchmarking-similarity-methods-for-speed-and-precision

@steve-marmalade
Copy link
Author

steve-marmalade commented Jan 7, 2025

Hi @Jonathan-Adly , thank you for the detailed response. For reference, I also asked this question in ColPali since it seemed like a more general request.

You do get document details when you search. So, you can do your search with relatively high k (5-10 for example) - and then - from there, get the document with most pages.

As a heuristic / work-around, I think this is very reasonable. However, it still doesn't guarantee we get the most relevant document. Consider the extreme case where page1 of a document has the highest similarity score (of all pages in the corpus) to the first embedding of a given query multi-embedding but has very low similarity to all of the other indices in the multi-embedding, page2 of the same document has the highest similarity score (of all pages in the corpus) to the second embedding of the given query multi-embedding but has very low similarity for all of the others, and so on. In this case, even with k=10, we might not retrieve any of these pages (because their maxsim for the entire query is low) even though they have a high similarity to a single embedding within the query and, together, would maximize similarity over the query.

This is why I think it might be nice to have a "document mode" where the page-level embeddings are concatenated before the maxsim operation is run.

I understand that this may not be a priority for you (since maybe most pages in the domain you are working in contain most of the information needed to match it a query), but hopefully this at least makes sense?

@Jonathan-Adly
Copy link
Contributor

Makes sense - I think the best way to implement this as a fool-proof way, is at the model level. From our side, I think it will be either this workaround or another workaround.

@steve-marmalade
Copy link
Author

Hi @Jonathan-Adly - sorry I didn't quite follow your last comment. Could you expand on what you mean by at the model level? Is there an external API e.g. in colpali_engine that you'd expect support for this to be added?

@Jonathan-Adly
Copy link
Contributor

The best way to support this would be to change the ColPali implementation as it is trained on pages, basically what Manu from colpali said. Max-sim happen on the page level. So, all other solutions would be a variation on the code above and would fail some edge cases.

For example, instead of counting pages, you can aggregate the scores of all pages under each document and pick the highest document. But this also a workaround - with some limitations.

@steve-marmalade
Copy link
Author

Thanks.

As I've learned more about this, I think that approaches like Late Chunking (already being discussed in colpali in reference to this excellent jinai overview) are looking to address the issue I've raised here (and seem promising).

Makes sense that from your end you'll wait to see what shakes out upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants