Skip to content

Go implementation of HLL that plays nicely with other languages

License

Notifications You must be signed in to change notification settings

segmentio/go-hll

Repository files navigation

go-hll Build and Test Go Report Card GoDoc

A Go implementation of HyperLogLog that is storage-compatible with the Aggregate Knowledge HLL Storage Spec.

Overview

HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.

In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed.

Motivation

While there are a handful of existing HLL implementations in Go, none of them implement the AK Storage Spec.
The unified storage format is useful for reading and writing HLLs in a multi-lingual environment. At Segment, most of our runtime code is written in Go, but we frequently persist data to PostgreSQL for long-term storage. The Postgres HLL plugin is fairly ubiquitous--it's available for the standalone server, AWS RDS, AWS Aurora, and CitusDB.

An excellent description for the motivation behind the storage strategy can be found in the Java HLL library's README.

Hashing

A good hashing algorithm is crucial to achieving the pseudorandomness that HLL requires in order to perform its calculations. The 64 bit variant of MurmurHash3 is recommended. If using a seed, it must be constant for all inputs to a given HLL. Further, if that HLL is to be unioned, then the same seed must be used for all inputs to the other HLL.

See the Java HLL README for a discussion on why MurmurHash3 is a good choice.

Adaptations to Go

The API is intended to be as similar as possible to Java HLL and Postgresql HLL. There are a couple of features, though, that make it more friendly to Go Programmers.

Zero Value

If default settings are specified using the hll.Defaults function, the zero value can be used directly as an empty HLL.

Since its impossible to reason about an HLL without the settings, operations on a zero value in lieu of default settings will panic.

StrictUnion

The other HLL implementations allow for two HLLs to be union-ed even if their log2m or regwidth parameters differ. However, doing so can produce wildly inaccurate results. This library provides an additional StrictUnion operation that will return an error if attempting a union on HLLs with incompatible settings.

Building

Dependencies are managed with Go Modules. Accordingly, this project requires Go version 1.12 or later.

Test

make test

Usage

package main 

import (
	"fmt"
	
	"github.com/segmentio/go-hll"
)

func main() {
	
	// install default settings.
	hll.Defaults(hll.Settings{
		Log2m: 10,
		Regwidth: 4,
		ExplicitThreshold: hll.AutoExplicitThreshold,
		SparseEnabled: true,
	})
	
	// add elements.
	h := hll.Hll{}
	h.AddRaw(123456789)
	fmt.Print(h.Cardinality())  // prints "1"
	
	// union Hlls
	h2 := hll.Hll{}
	h2.AddRaw(123456789)
	h2.AddRaw(987654321)
	h2.Union(h)
	fmt.Print(h2.Cardinality()) // prints "2"
 
	// write to/read from bytes. 
	h3, _ := hll.FromBytes(h2.ToBytes())
	fmt.Print(h3.Cardinality()) // prints "2"
}

There is a compatibility battery run as part of the unit tests. The battery was produced by the Test Generator in the java-hll package.

Additional Resources

License

Released under the MIT license.