You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
My problem arose when I wanted to add classical Information Retrieval query-features. Most of them are described in the following paper : Query Performance Prediction.
For this, we need some key statistics about the index :
ql : query length (number of words in the query), which is already implemented with queryTermCount
N : the number of documents in the whole collection
Nt : the number of documents in which the query term t appears
term_coll: the number of terms in the whole collection
tf_coll : the number of occurrences of a query term in the whole collection
Describe the solution you'd like
I'd like these key index statistics to be exposed, as most of them are not available in vespa (none, except for ql).
This way they could be used to create our Information Retrieval feature functions.
As Jon Bratseth explained in a dedicated slack thread, vespa follows the architecture of a positional index. This means that the implementation could be done quite easily :
N : the number of documents in the whole collection -> sum of postings length
Nt : the number of documents in which the query term t appears -> could be the t posting length
term_coll: the number of terms in the whole collection -> number of postings lists (size of dictionary)
tf_coll : the number of occurrences of a query term t in the whole collection -> as its a positional index, we can sum the number of occurrences in each doc of the t posting
For how term-related are exposed, I think the same way as term(n).significance would be nice.
We could for example add:
term(n).document_count
term(n).collection_count
For how collection-related are exposed, I guess putting them in query-features might not be the best place, but I'm not vespa-aware enough to propose a better place. They could however be named like :
index.document_count
index.term_count OR dictionary.size
Describe alternatives you've considered
Use an export that we have (vespa visit), and run a pyspark job that will count individual terms, individual documents and documents with each query term (from precedent "individual terms" computation).
This solution is very expensive, both in compute and development, never up-to-date, and not accurate (because of difficulty to re-create the same tokenizer for example).
Additional context
An example of query-feature that yet cannot be implemented easily :
Another example :
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
My problem arose when I wanted to add classical Information Retrieval query-features. Most of them are described in the following paper : Query Performance Prediction.
For this, we need some key statistics about the index :
ql
: query length (number of words in the query), which is already implemented with queryTermCountN
: the number of documents in the whole collectionNt
: the number of documents in which the query term t appearsterm_coll
: the number of terms in the whole collectiontf_coll
: the number of occurrences of a query term in the whole collectionDescribe the solution you'd like
I'd like these key index statistics to be exposed, as most of them are not available in vespa (none, except for
ql
).This way they could be used to create our Information Retrieval feature functions.
As Jon Bratseth explained in a dedicated slack thread, vespa follows the architecture of a positional index. This means that the implementation could be done quite easily :
N
: the number of documents in the whole collection -> sum of postings lengthNt
: the number of documents in which the query term t appears -> could be thet
posting lengthterm_coll
: the number of terms in the whole collection -> number of postings lists (size of dictionary)tf_coll
: the number of occurrences of a query termt
in the whole collection -> as its a positional index, we can sum the number of occurrences in each doc of thet
postingFor how term-related are exposed, I think the same way as
term(n).significance
would be nice.We could for example add:
term(n).document_count
term(n).collection_count
For how collection-related are exposed, I guess putting them in query-features might not be the best place, but I'm not vespa-aware enough to propose a better place. They could however be named like :
index.document_count
index.term_count
ORdictionary.size
Describe alternatives you've considered
Use an export that we have (vespa visit), and run a pyspark job that will count individual terms, individual documents and documents with each query term (from precedent "individual terms" computation).
This solution is very expensive, both in compute and development, never up-to-date, and not accurate (because of difficulty to re-create the same tokenizer for example).
Additional context
An example of query-feature that yet cannot be implemented easily :
Another example :
The text was updated successfully, but these errors were encountered: