A Go implementation of HyperLogLog that is storage-compatible with the Aggregate Knowledge HLL Storage Spec.
HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.
In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed.
While there are a handful of existing HLL implementations in Go, none of them implement the AK Storage Spec.
The unified storage format is useful for reading and writing HLLs in a multi-lingual environment. At Segment, most of
our runtime code is written in Go, but we frequently persist data to PostgreSQL for long-term storage. The Postgres HLL
plugin is fairly ubiquitous--it's available for the standalone server, AWS RDS, AWS Aurora, and CitusDB.
An excellent description for the motivation behind the storage strategy can be found in the Java HLL library's README.
A good hashing algorithm is crucial to achieving the pseudorandomness that HLL requires in order to perform its calculations. The 64 bit variant of MurmurHash3 is recommended. If using a seed, it must be constant for all inputs to a given HLL. Further, if that HLL is to be unioned, then the same seed must be used for all inputs to the other HLL.
See the Java HLL README for a discussion on why MurmurHash3 is a good choice.
The API is intended to be as similar as possible to Java HLL and Postgresql HLL. There are a couple of features, though, that make it more friendly to Go Programmers.
If default settings are specified using the hll.Defaults
function, the zero value can be used directly as an empty HLL.
Since its impossible to reason about an HLL without the settings, operations on a zero value in lieu of default settings will panic.
The other HLL implementations allow for two HLLs to be union-ed even if their log2m or regwidth parameters differ.
However, doing so can produce wildly inaccurate results. This library provides an additional StrictUnion
operation
that will return an error if attempting a union on HLLs with incompatible settings.
Dependencies are managed with Go Modules. Accordingly, this project requires Go version 1.12 or later.
make test
package main
import (
"fmt"
"github.com/segmentio/go-hll"
)
func main() {
// install default settings.
hll.Defaults(hll.Settings{
Log2m: 10,
Regwidth: 4,
ExplicitThreshold: hll.AutoExplicitThreshold,
SparseEnabled: true,
})
// add elements.
h := hll.Hll{}
h.AddRaw(123456789)
fmt.Print(h.Cardinality()) // prints "1"
// union Hlls
h2 := hll.Hll{}
h2.AddRaw(123456789)
h2.AddRaw(987654321)
h2.Union(h)
fmt.Print(h2.Cardinality()) // prints "2"
// write to/read from bytes.
h3, _ := hll.FromBytes(h2.ToBytes())
fmt.Print(h3.Cardinality()) // prints "2"
}
There is a compatibility battery run as part of the unit tests. The battery was produced by the Test Generator in the java-hll package.
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
- Understanding the HyperLogLog
Released under the MIT license.