title | abstract | layout | series | publisher | issn | id | month | tex_title | firstpage | lastpage | page | order | cycles | bibtex_author | author | date | address | container-title | volume | genre | issued | extras | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A Bayesian nonparametric approach to count-min sketch under power-law data streams |
The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens’ frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token’s frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data. |
inproceedings |
Proceedings of Machine Learning Research |
PMLR |
2640-3498 |
dolera21a |
0 |
A Bayesian nonparametric approach to count-min sketch under power-law data streams |
226 |
234 |
226-234 |
226 |
false |
Dolera, Emanuele and Favaro, Stefano and Peluchetti, Stefano |
|
2021-03-18 |
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics |
130 |
inproceedings |
|
|