-
Notifications
You must be signed in to change notification settings - Fork 0
Flajolet–Martin algorithm
The Flajolet–Martin algorithm efficiently approximates the number of distinct elements in a stream.
Finding the number of distinct elements in a data stream with repeated elements.
But why do we need distinct elements?
It has a major application in database query optimizations. Suppose, we want to get the intersection of two collections of data. There can be different ways to do this:
- Sort A, search each element of B in A, and retain it if it appears in A;
- sort A, sort B, then perform a merge-like operation to determine the intersection;
- eliminate duplicates in A and/or B using hashing or hash filters, then perform Algorithm 1 or 2.
These strategies have cost and the sizes of the records, and the number of distinct elements plays a major role in the cost.
The Algorithm:
- Consider a non-empty stream S and initializemax to 0
- For every element in the Stream S
- Find hash value ‘h’
- Convert value of ‘h’ into binary value as ‘b’
- Find the number of trailing 0’s and set it to ‘n’
- If (n>max)
- Max=n
- CountDistinctElements = 2n
Example
-
Input streams x=1,3,2,1,2,3,4,3,1,2,3,1
-
Hash function
-
H(x)=6x + 1 mod 5
-
H(1)=6(1)+1 mod 5 =7 mod 5 =2
-
Input stream =1,3,2,1,2,3,4,3,1,2,3,1
-
h(x)=6x + 1 mod 5
h(1)=2
h(3)=4
h(2)=3
h(1)=2
h(2)=3
h(3)=4
h(4)=0
h(3)=4
h(1)=2
h(2)=3
h(3)=3
h(1)=2
- Binary representation
h(1)=2=010
h(3)=4=100
h(2)=3=011
h(1)=2=010
h(2)=3=011
h(3)=4 =100
h(4)=0=000
h(3)=4=100
h(1)=2=010
h(2)=3=011
h(3)=3=011
h(1)=2=010
The highest number of trailing zeros is 2.
No. of distinct elements is 2^2 = 4 i.e 1,2,3,4