Index statistics for classical Information Retrieval features #32002

Plenitude-ai · 2024-07-24T12:32:31Z

Is your feature request related to a problem? Please describe.
My problem arose when I wanted to add classical Information Retrieval query-features. Most of them are described in the following paper : Query Performance Prediction.
For this, we need some key statistics about the index :

ql : query length (number of words in the query), which is already implemented with queryTermCount
N : the number of documents in the whole collection
Nt : the number of documents in which the query term t appears
term_coll: the number of terms in the whole collection
tf_coll : the number of occurrences of a query term in the whole collection

Describe the solution you'd like
I'd like these key index statistics to be exposed, as most of them are not available in vespa (none, except for ql).
This way they could be used to create our Information Retrieval feature functions.
As Jon Bratseth explained in a dedicated slack thread, vespa follows the architecture of a positional index. This means that the implementation could be done quite easily :

N : the number of documents in the whole collection -> sum of postings length
Nt : the number of documents in which the query term t appears -> could be the t posting length
term_coll: the number of terms in the whole collection -> number of postings lists (size of dictionary)
tf_coll : the number of occurrences of a query term t in the whole collection -> as its a positional index, we can sum the number of occurrences in each doc of the t posting

For how term-related are exposed, I think the same way as term(n).significance would be nice.
We could for example add:

term(n).document_count
term(n).collection_count

For how collection-related are exposed, I guess putting them in query-features might not be the best place, but I'm not vespa-aware enough to propose a better place. They could however be named like :

index.document_count
index.term_count OR dictionary.size

Describe alternatives you've considered
Use an export that we have (vespa visit), and run a pyspark job that will count individual terms, individual documents and documents with each query term (from precedent "individual terms" computation).
This solution is very expensive, both in compute and development, never up-to-date, and not accurate (because of difficulty to re-create the same tokenizer for example).

Additional context
An example of query-feature that yet cannot be implemented easily :

Another example :

The text was updated successfully, but these errors were encountered:

olaaun assigned bratseth Jul 25, 2024

kkraune added this to the later milestone Jul 31, 2024

kkraune assigned geirst and unassigned bratseth Jul 31, 2024

kkraune added the enhancement label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index statistics for classical Information Retrieval features #32002

Index statistics for classical Information Retrieval features #32002

Plenitude-ai commented Jul 24, 2024 •

edited

Loading

Index statistics for classical Information Retrieval features #32002

Index statistics for classical Information Retrieval features #32002

Comments

Plenitude-ai commented Jul 24, 2024 • edited Loading

Plenitude-ai commented Jul 24, 2024 •

edited

Loading