Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tunable error rate #19

Open
nathanhack opened this issue Sep 10, 2020 · 8 comments
Open

Tunable error rate #19

nathanhack opened this issue Sep 10, 2020 · 8 comments

Comments

@nathanhack
Copy link

nathanhack commented Sep 10, 2020

It would be great to have a way to change/tune the false negative rate.

@thomasmueller
Copy link

I assume you mean false positive rate, right? I know some libraries provide functions to e.g. calculate the bits per entry for a false positive rate, or the false positive rate for a given bits per entry, or the size of the filter (memory used) from rate and entry count, and so on. Yes I agree it would be a good addition!

@nathanhack
Copy link
Author

Sorry, yes, I meant false positive rate.

@nathanhack
Copy link
Author

In particular, I'm interested in the case of setting a fixed size and letting the positive rate fluctuate as the keys are inserted.
@thomasmueller do you have any suggestion on how the repo might be changed to include such a feature? As time permits, I'll work on it and if/when I get it done I can put in a PR if that is something the repo owners would like.

@thomasmueller
Copy link

setting a fixed size and letting the positive rate fluctuate as the keys are inserted.

A xor filter, unlike a Bloom filter, is built once all keys are present. A Bloom filter allows to add entries afterwards, but a xor filter does not. So what you are looking for might not be possible I'm afraid...

@nathanhack
Copy link
Author

Yeah to clarify, I indeed meant all at once, such that you set the size and let the error rate suffer.
After briefly looking at the paper, I would hazard a guess the main issue would be how to choose the hashing functions? Because, if I understand it correctly we basically keep trying new hashing until we get everything in, but with a fixed size this would always fail. If true then the simplest approach would be to keep track of the best set of hashes and at the end of MaxIterations, just use the best hashes. That way if the user wishes to continue trying their luck they can just increase MaxIterations. It seems like there should be some way to calculate the expected error rate based on the fixed size but I don't recall seeing that explicitly written down in the paper (maybe it was or maybe it could be derived but I honestly don't know it would require a through reading).

@thomasmueller
Copy link

I indeed meant all at once

Ah ok! Yes that works then.

you set the size and let the error rate suffer.

The implementation here (the current implementation) only supports 8 bit fingerprints (uint8). It doesn't support smaller or larger fingerprints right now. So I'm afraid this doesn't currently work... It's possible to add more implementations, but we believe it's not currently worth it... Of course feel free to provide a patch!

increase MaxIterations

No I'm afraid that's not a solution... What you could change a tiny bit is this line:

capacity := 32 + uint32(math.Ceil(1.23*float64(size)))

and, for example, try to use 1.229 instead of 1.23. It would work sometimes. But probability to find a seed reduces drastically, and very quickly... It's not something I would try. With a regular Bloom filter, yes you can just make it smaller and it will still "work" (just with a bad false positive rate), but not a xor filter.

@brianolson
Copy link

I implemented xor16 and I think I got it generalized also to xorN (for [9..32] bits).
https://github.com/brianolson/xorfilter/tree/xor16
There's some other stuff in there about maintaining a fork in case this work didn't make it back upstream, but that shouldn't be hard to un-diff.

@lemire
Copy link
Member

lemire commented Jun 18, 2021

@brianolson Your generic implementation seems to set the memory usage to 32-bit per entry. I understand that you can write only the least significant N bits of each word to disk or to the network, and then upsample when deserializing... but it comes at a cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants