Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive errors when inserting ~1M values #11

Open
kokes opened this issue Sep 28, 2018 · 4 comments
Open

Excessive errors when inserting ~1M values #11

kokes opened this issue Sep 28, 2018 · 4 comments

Comments

@kokes
Copy link
Contributor

kokes commented Sep 28, 2018

Maybe I'm doing something horribly wrong, but I'm getting cardinality errors way beyond the specified boundary. All I do is that I run this simple snippet and it starts going off track at about 700k

package main

import (
	"fmt"
	"math/rand"

	"github.com/mynameisfiber/gohll"
)

func main() {
	sizes := []int{1000, 10000, 100000, 500000, 750000, 1000000, 5000000, 10000000, 25000000}
	for _, sz := range sizes {
		h, err := gohll.NewHLLByError(0.001)
		if err != nil {
			panic(err)
		}

		for i := 0; i < sz; i++ {
			h.Add(fmt.Sprintf("%d", rand.Uint32()))
		}
		c := h.Cardinality()
		fmt.Printf("expected: %v, got: %.2f, err: %.2f%%\n", sz, c, 100*(1-float64(sz)/c))
	}
}

This is what I got:

$ go run first.go 
expected: 1000, got: 1000.01, err: 0.00%
expected: 10000, got: 9999.49, err: -0.01%
expected: 100000, got: 99989.83, err: -0.01%
expected: 500000, got: 499939.97, err: -0.01%
expected: 750000, got: 1904282.61, err: 60.62%
expected: 1000000, got: 2048181.58, err: 51.18%
expected: 5000000, got: 5142333.09, err: 2.77%
expected: 10000000, got: 9987380.00, err: -0.13%
expected: 25000000, got: 24947424.06, err: -0.21%

When I bisected this around the 700k mark, I found out it breaks at the 640k mark.

$ go run first.go 
expected: 600000, got: 599886.58, err: -0.02%
expected: 610000, got: 610057.31, err: 0.01%
expected: 620000, got: 620025.35, err: 0.00%
expected: 630000, got: 629970.89, err: -0.00%
expected: 640000, got: 1842767.34, err: 65.27%
expected: 650000, got: 1848513.38, err: 64.84%
expected: 660000, got: 1853794.67, err: 64.40%
expected: 670000, got: 1859316.35, err: 63.97%
expected: 680000, got: 1865042.87, err: 63.54%
expected: 690000, got: 1871086.33, err: 63.12%
expected: 700000, got: 1876002.44, err: 62.69%
expected: 710000, got: 1881378.30, err: 62.26%
expected: 720000, got: 1887115.06, err: 61.85%
expected: 730000, got: 1892798.72, err: 61.43%
expected: 740000, got: 1898367.82, err: 61.02%
expected: 750000, got: 1904090.15, err: 60.61%

I'm not terribly familiar with the design of the library to comment any further at this point. But let me know if you need a bit more investigation into this.

@mynameisfiber
Copy link
Owner

Hrmm... this is really interesting! I think it has to do with the transition from sparse mode to normal mode. That being said, I haven't played with this library in quite a while so I've lost a sense of at which cardinalities this switch happens for various error rates.

If you could dig a bit deeper, that would be amazing! If not, I'll try to get to it in the next couple weeks.

@mynameisfiber
Copy link
Owner

So a couple things I've found:

  • The problem only occurs when the HLL is build with P >= 19
  • It happens when the HLL is converted from SPARSE to NORMAL mode
  • It happens with both MMH3 and FNV1a hashes
  • The encode/decode system of the sparse list seems to work even with P >= 19
  • I modified the toNormal routine to first merge in the tempList then do the whole register procedure and this doesn't fix the problem

@mynameisfiber
Copy link
Owner

mynameisfiber commented Oct 19, 2018

Actually... I take some of that back. I think the problem is when the HLL precision and the SparseList precision are too close to eachother. In this code we have the sparse precision hard coded to 25. This could cause a problem in the first branch of auxillary.decodeHash.

I'm going to play around with having a dynamically picked sparse precision... but that may have to wait a bit!

@kokes
Copy link
Contributor Author

kokes commented Oct 21, 2018

I suspect it's really the "normal" calculation, which is only correct once the number of elements is beyond ~1M, the above example is correct below 640K, because it's calculated using the sparse method. If you force the normal method from the start, the numbers are way off from the get go.

(...)
h, err := gohll.NewHLLByError(0.001)
h.ToNormal()
(...)

Results in

$ go run err.go 
expected: 1000, got: 1513154.88, err: 99.93%
expected: 10000, got: 1517511.97, err: 99.34%
expected: 100000, got: 1561267.99, err: 93.59%
expected: 500000, got: 1766954.08, err: 71.70%
expected: 750000, got: 1904282.61, err: 60.62%
expected: 1000000, got: 2048181.58, err: 51.18%
expected: 5000000, got: 5142333.09, err: 2.77%
expected: 10000000, got: 9987380.00, err: -0.13%
expected: 25000000, got: 24947424.06, err: -0.21%```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants