Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index statistics for classical Information Retrieval features #32002

Open
Plenitude-ai opened this issue Jul 24, 2024 · 0 comments
Open

Index statistics for classical Information Retrieval features #32002

Plenitude-ai opened this issue Jul 24, 2024 · 0 comments
Assignees
Milestone

Comments

@Plenitude-ai
Copy link

Plenitude-ai commented Jul 24, 2024

Is your feature request related to a problem? Please describe.
My problem arose when I wanted to add classical Information Retrieval query-features. Most of them are described in the following paper : Query Performance Prediction.
For this, we need some key statistics about the index :

  • ql : query length (number of words in the query), which is already implemented with queryTermCount
  • N : the number of documents in the whole collection
  • Nt : the number of documents in which the query term t appears
  • term_coll: the number of terms in the whole collection
  • tf_coll : the number of occurrences of a query term in the whole collection

Describe the solution you'd like
I'd like these key index statistics to be exposed, as most of them are not available in vespa (none, except for ql).
This way they could be used to create our Information Retrieval feature functions.
As Jon Bratseth explained in a dedicated slack thread, vespa follows the architecture of a positional index. This means that the implementation could be done quite easily :

  • N : the number of documents in the whole collection -> sum of postings length
  • Nt : the number of documents in which the query term t appears -> could be the t posting length
  • term_coll: the number of terms in the whole collection -> number of postings lists (size of dictionary)
  • tf_coll : the number of occurrences of a query term t in the whole collection -> as its a positional index, we can sum the number of occurrences in each doc of the t posting

For how term-related are exposed, I think the same way as term(n).significance would be nice.
We could for example add:

  • term(n).document_count
  • term(n).collection_count

For how collection-related are exposed, I guess putting them in query-features might not be the best place, but I'm not vespa-aware enough to propose a better place. They could however be named like :

  • index.document_count
  • index.term_count OR dictionary.size

Describe alternatives you've considered
Use an export that we have (vespa visit), and run a pyspark job that will count individual terms, individual documents and documents with each query term (from precedent "individual terms" computation).
This solution is very expensive, both in compute and development, never up-to-date, and not accurate (because of difficulty to re-create the same tokenizer for example).

Additional context
An example of query-feature that yet cannot be implemented easily :
image
Another example :
image

@kkraune kkraune added this to the later milestone Jul 31, 2024
@kkraune kkraune assigned geirst and unassigned bratseth Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants