Skip to content

Infrastructure for managing large arrays of counters

License

Apache-2.0, LGPL-2.1 licenses found

Licenses found

Apache-2.0
LICENSE-Apache-2.0
LGPL-2.1
LICENSE-LGPL-2.1-or-later
Notifications You must be signed in to change notification settings

vigna/card-est-array-rs

Repository files navigation

Arrays of Cardinality Estimators

downloads dependents GitHub CI license Line Count Latest Version Documentation Coverage Status

Infrastructure for managing large arrays of cardinality estimators.

Why

A cardinality estimator, AKA (probabilistic) sketch, AKA probabilistic counter is a probabilistic data structure that has an “add” primitive similarly to a dictionary, but does not have a corresponding “contains”: it is just possible to query the size of the estimator. In other words, a cardinality estimator remembers the number of distinct elements that have been added to it.

The returned size is only an approximation of the real size, and the precision can be tuned, but in exchange cardinality estimators use very little space: for example, a HyperLogLog cardinality estimator uses 2ᵇlog log n bits to achieve an average relative error of 1.04/√2ᵇ (log log n ≤ 6 for all practical datasets).

It is common, for example, to use cardinality estimators to measure the number of unique users in click streams. But more interesting applications use the fact that many cardinality estimators are mergeable, which means that given two estimators it is possible to compute (in time linear in the size of the estimators) a new estimator containing the union of the elements that have been added to the two original estimators.

This idea is at the core of ANF, an algorithm for the computation of the neighborhood function (the function telling how many pairs of nodes are within distance t) that used Flajolet–Martin cardinality estimators (then called probabilistic counters). The technique was then extended to log-logarithmic cardinality estimators, with a significant reduction of the memory footprint, using broadword programming to merge such estimators; it became also evident that it could be used to compute many other interesting properties, such as the distance distribution and all centralities based on the node neighborhood functions (the functions telling, for each node, how many other nodes are within distance t). The HyperBall algorithm, distributed with the WebGraph framework, is a highly engineered implementation of these ideas. It has been used, for example, to compute the degrees of separation of Facebook.

The purpose of this crate is to lay the foundation of the infrastructure that is necessary to implement HyperBall in the Rust port of the WebGraph framework. We provide implementation of cardinality estimators and structures handling large arrays of estimators sharing the same parameters and logic with minimal memory overhead. Sharing parameters is essential for scaling to billions of estimators, and this is why a separate structure for arrays of estimators is necessary—just putting a billion estimators in a vector would waste a large amount of space.

For this reason, sometimes the traits give a low-level feeling. While single estimators are encapsulated in suitable structures, arrays of estimators are made of the concatenation of estimator backends, and each estimator is manipulated applying suitable methods to the backend. But it is the responsibility of the user that backends and logics are matched.

Acknowledgments

This software has been partially supported by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU, and by project ANR COREGRAPHIE, grant ANR-20-CE23-0002 of the French Agence Nationale de la Recherche. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them.

About

Infrastructure for managing large arrays of counters

Resources

License

Apache-2.0, LGPL-2.1 licenses found

Licenses found

Apache-2.0
LICENSE-Apache-2.0
LGPL-2.1
LICENSE-LGPL-2.1-or-later

Stars

Watchers

Forks

Packages

No packages published

Languages