Skip to content

Flajolet–Martin algorithm

Janvi Talreja edited this page Jan 20, 2023 · 5 revisions

Introduction

The Flajolet–Martin algorithm efficiently approximates the number of distinct elements in a stream.

Problem Statement

Finding the number of distinct elements in a data stream with repeated elements.

But why do we need distinct elements?

It has a major application in database query optimizations. Suppose, we want to get the intersection of two collections of data. There can be different ways to do this:

  1. Sort A, search each element of B in A, and retain it if it appears in A;
  2. sort A, sort B, then perform a merge-like operation to determine the intersection;
  3. eliminate duplicates in A and/or B using hashing or hash filters, then perform Algorithm 1 or 2.

These strategies have cost and the sizes of the records, and the number of distinct elements plays a major role in the cost.

The Algorithm:

  1. Consider a non-empty stream S and initializemax to 0
  2. For every element in the Stream S
    1. Find hash value ‘h’
    2. Convert value of ‘h’ into binary value as ‘b’
    3. Find the number of trailing 0’s and set it to ‘n’
    4. If (n>max)
  3. Max=n
  4. CountDistinctElements = 2n

Example

  1. Input streams x=1,3,2,1,2,3,4,3,1,2,3,1

  2. Hash function

  3. H(x)=6x + 1 mod 5

  4. H(1)=6(1)+1 mod 5 =7 mod 5 =2

  5. Input stream =1,3,2,1,2,3,4,3,1,2,3,1

  6. h(x)=6x + 1 mod 5

h(1)=2

h(3)=4

h(2)=3

h(1)=2

h(2)=3

h(3)=4

h(4)=0

h(3)=4

h(1)=2

h(2)=3

h(3)=3

h(1)=2

  1. Binary representation

h(1)=2=010

h(3)=4=100

h(2)=3=011

h(1)=2=010

h(2)=3=011

h(3)=4 =100

h(4)=0=000

h(3)=4=100

h(1)=2=010

h(2)=3=011

h(3)=3=011

h(1)=2=010

The highest number of trailing zeros is 2.

No. of distinct elements is 2^2 = 4 i.e 1,2,3,4

Clone this wiki locally