Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4 billion records max? #38

Open
Kleissner opened this issue Jan 17, 2021 · 9 comments
Open

4 billion records max? #38

Kleissner opened this issue Jan 17, 2021 · 9 comments

Comments

@Kleissner
Copy link

I just realized that index.numKeys is a 32-bit uint, and there's MaxKeys = math.MaxUint32 😲

I think it would make sense to change it to 64-bit (any reason why we wouldn't support max 64-bit number of records)? I assume it would break existing dbs (but is still necessary)?

At least it should be clearly stated as limitation in the readme I would suggest.

Our use case is to store billions of records. We've reached already 2 billion records with Pogreb - which means in a matter of weeks we'll hit the current upper limit 😢

@akrylysov
Copy link
Owner

Unfortunately just changing the constant to math.MaxUint64 won't work. Pogreb uses a 32-bit hash function. Storing more than math.MaxUint32 keys without changing the hash function to a 64-bit version would result in high rate of hash collisions and poor performance. Changing the hash function to a 64-bit version would require changing the internal bucket structure. It would add 4-byte disk space overhead for each key in the database. I'll consider changing it in the future.

Even storing a billion keys with a 32-bit hash function is not great. The closer to 4 billion you get, the more hash collisions you'll see.

For now, I would recommend sharding the database - running multiple databases.

Can you tell me more about how you use Pogreb? What is your typical access pattern? Is it write-heavy? What is your average key and value size?

@Kleissner
Copy link
Author

Apologies for the delay. The use case is for https://intelx.io storing hashes of all of our records in a key-value database which helps for some internal caching operations. The plan is to update the key-value store every 24 hours, so it would be "write-heavy-once" then read heavy.

We are still running into the other troubles (the weird disk errors coming from NTFS), but those I can handle/fix myself.
If you would upgrade the code to support 64-bit amount of records that would be great, I believe many other people who are involved in those kinds of operations would hit that 4 billion record limit fairly quick as well.

For now I have shutdown the key-value store as we are too dangerously close to the 4 billion records and I'm afraid of hash collisions and false positive lookups.

@akrylysov
Copy link
Owner

Thanks for the details! While the database will get slower as it gets close to 4 billion keys, it won't impact correctness, you don't need to worry about false positives. After doing a hash lookup Pogreb compares the key to the data in the WAL, so false positives are impossible.

@Kleissner
Copy link
Author

You can close all the issues that I opened. We stopped using Pogreb earlier this year when all those issues appeared.
Unfortunately the 4 billion limit is an absolute breaker for us (we get now 4+ billion new records per month).

The plan was to keep the Pogreb running in parallel and switch over once the issues have been solved, but since this hasn't been resolved I have decided to switch over to a different key-value database.

@derkan
Copy link

derkan commented Mar 11, 2021

@Kleissner just curious, what are you using now?

@gnewton
Copy link

gnewton commented May 13, 2021

Yes, the 4B record limit is a deal breaker for me also. I was hoping on using this instead of bolt, but now cannot. Any chance of changing this? It is a real limit for people with large # items to manage.
BTW, this is very impressive work.

@Kleissner
Copy link
Author

@derkan we have tried:

  • Postgres: Obviously an overkill for just storing key-value
  • Badger: Buggy, crashes sometimes, corrupts database. Updates break compatibility. Uses C code.
  • Bolt: No longer actively maintained, suffers from out of memory crashes and corrupted database.
  • Bitcask: High memory usage (more than disk).
  • Pogreb: Pure Go, but not more than 4 billion records supported

We fell back to continue using Bitcask, but half abandoned our internal project altogether since no suitable key-value database was found. Each new run takes a few weeks to recompile the key-value database (since we have billions of records) and is therefore resource and time intensive.

@fahmifan
Copy link

@Kleissner have you check etcd-io/bbolt? It was a forked of bolt db and still maintained by etcd team

@artjoma
Copy link

artjoma commented Feb 28, 2024

@Kleissner just curious, what are you using now?

Look at PebbleDB. Ethereum Geth use it as blockchain storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants