Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent BF.ADD: modify the implementation of bloom filters to achieve similar result as redis #1287

Open
shashi-sah2003 opened this issue Nov 15, 2024 · 11 comments
Assignees

Comments

@shashi-sah2003
Copy link
Contributor

Steps to reproduce

  1. bf.add bf1 item1
  2. bf.info bf1

Expected output

The expected output when the above set of commands when run on Redis

image

Observed output

The observed output when the above set of commands when run on DiceDB

image

The steps to run the test cases are mentioned in the README of the dice-tests repository.

Expectations for resolution

This issue will be considered resolved when the following things are done

  1. changes in the dice code to meet the expected behavior
  2. Successful run of the tcl test behavior

You can find the tests under the tests directory of the dice repository and the steps to run are in the README file. Refer to the following links to set up DiceDB and Redis 7.2.5 locally

@shashi-sah2003
Copy link
Contributor Author

@JyotinderSingh @apoorvyadav1111 I am going to work on this. pls assign
thanks

@shashi-sah2003
Copy link
Contributor Author

shashi-sah2003 commented Nov 15, 2024

image
image

@JyotinderSingh @apoorvyadav1111 from above you can see in bloom.go the value for capacity has been set to 1024 and number of filters has been set to number of hash function explicitly. I believe in redis implementation they use only on filter but in dicedb implementation number of filters are decided according to number of bits per element. Should I go ahead and change this configuration to use only one filter with 100 capacity?

@apoorvyadav1111
Copy link
Contributor

@shashi-sah2003 , yes please go ahead. Please consider that while we aim at being redis-compliant, finer values can differ based on the implementation of the data structure. That said, bloom filters do need an upgrade especially in expansion (currently its fixed). Thank you for working on this and lets stay connected on discord for this issue.

@shashi-sah2003
Copy link
Contributor Author

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

@JyotinderSingh
Copy link
Collaborator

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@shashi-sah2003
Copy link
Contributor Author

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@JyotinderSingh @apoorvyadav1111 I have done some more research from my side and found out that Redis has a capacity of 100 , so only one filter is sufficient for this purpose while in DICEDB implementation the capacity is 1024 which requires more number of filters (to reduce false positive rate) which is calculated using errorRate

@JyotinderSingh
Copy link
Collaborator

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@JyotinderSingh @apoorvyadav1111 I have done some more research from my side and found out that Redis has a capacity of 100 , so only one filter is sufficient for this purpose while in DICEDB implementation the capacity is 1024 which requires more number of filters (to reduce false positive rate) which is calculated using errorRate

In that case we can adjust our filters accordingly

@shashi-sah2003
Copy link
Contributor Author

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@JyotinderSingh @apoorvyadav1111 I have done some more research from my side and found out that Redis has a capacity of 100 , so only one filter is sufficient for this purpose while in DICEDB implementation the capacity is 1024 which requires more number of filters (to reduce false positive rate) which is calculated using errorRate

In that case we can adjust our filters accordingly

Then how should I go about this? adjust the capacity and filters accordinly as redis?

@JyotinderSingh
Copy link
Collaborator

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@JyotinderSingh @apoorvyadav1111 I have done some more research from my side and found out that Redis has a capacity of 100 , so only one filter is sufficient for this purpose while in DICEDB implementation the capacity is 1024 which requires more number of filters (to reduce false positive rate) which is calculated using errorRate

In that case we can adjust our filters accordingly

Then how should I go about this? adjust the capacity and filters accordinly as redis?

We don't need to match redis output in this case.

@shashi-sah2003
Copy link
Contributor Author

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@JyotinderSingh @apoorvyadav1111 I have done some more research from my side and found out that Redis has a capacity of 100 , so only one filter is sufficient for this purpose while in DICEDB implementation the capacity is 1024 which requires more number of filters (to reduce false positive rate) which is calculated using errorRate

In that case we can adjust our filters accordingly

Then how should I go about this? adjust the capacity and filters accordinly as redis?

We don't need to match redis output in this case.

yeah also in the codebase, the optimized formula is being used to calculate number of filters which is calculated using errorRate, for now it has been set to 0.01
we can also try to reduce the errorRate 0.001 which ensures less false positive rates but it would require more space and more number of filters. what's our aim for now? is it having less false positive rate or less space?

@JyotinderSingh
Copy link
Collaborator

hey @JyotinderSingh @apoorvyadav1111 reducing the number of filters will lead to increase in false positive rate which we dont want right? but I wonder why redis uses only one filter?

We would need to investigate why they went with it. I'm not aware of the reasoning for that.

@JyotinderSingh @apoorvyadav1111 I have done some more research from my side and found out that Redis has a capacity of 100 , so only one filter is sufficient for this purpose while in DICEDB implementation the capacity is 1024 which requires more number of filters (to reduce false positive rate) which is calculated using errorRate

In that case we can adjust our filters accordingly

Then how should I go about this? adjust the capacity and filters accordinly as redis?

We don't need to match redis output in this case.

yeah also in the codebase, the optimized formula is being used to calculate number of filters which is calculated using errorRate, for now it has been set to 0.01

we can also try to reduce the errorRate 0.001 which ensures less false positive rates but it would require more space and more number of filters. what's our aim for now? is it having less false positive rate or less space?

That could be a separate discussion, it would depend on a lot of factors which are unclear for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants