4 billion records max? #38

Kleissner · 2021-01-17T04:38:53Z

I just realized that index.numKeys is a 32-bit uint, and there's MaxKeys = math.MaxUint32 😲

I think it would make sense to change it to 64-bit (any reason why we wouldn't support max 64-bit number of records)? I assume it would break existing dbs (but is still necessary)?

At least it should be clearly stated as limitation in the readme I would suggest.

Our use case is to store billions of records. We've reached already 2 billion records with Pogreb - which means in a matter of weeks we'll hit the current upper limit 😢

The text was updated successfully, but these errors were encountered:

akrylysov · 2021-02-10T02:13:21Z

Unfortunately just changing the constant to math.MaxUint64 won't work. Pogreb uses a 32-bit hash function. Storing more than math.MaxUint32 keys without changing the hash function to a 64-bit version would result in high rate of hash collisions and poor performance. Changing the hash function to a 64-bit version would require changing the internal bucket structure. It would add 4-byte disk space overhead for each key in the database. I'll consider changing it in the future.

Even storing a billion keys with a 32-bit hash function is not great. The closer to 4 billion you get, the more hash collisions you'll see.

For now, I would recommend sharding the database - running multiple databases.

Can you tell me more about how you use Pogreb? What is your typical access pattern? Is it write-heavy? What is your average key and value size?

Kleissner · 2021-02-25T11:43:06Z

Apologies for the delay. The use case is for https://intelx.io storing hashes of all of our records in a key-value database which helps for some internal caching operations. The plan is to update the key-value store every 24 hours, so it would be "write-heavy-once" then read heavy.

We are still running into the other troubles (the weird disk errors coming from NTFS), but those I can handle/fix myself.
If you would upgrade the code to support 64-bit amount of records that would be great, I believe many other people who are involved in those kinds of operations would hit that 4 billion record limit fairly quick as well.

For now I have shutdown the key-value store as we are too dangerously close to the 4 billion records and I'm afraid of hash collisions and false positive lookups.

akrylysov · 2021-02-25T14:45:06Z

Thanks for the details! While the database will get slower as it gets close to 4 billion keys, it won't impact correctness, you don't need to worry about false positives. After doing a hash lookup Pogreb compares the key to the data in the WAL, so false positives are impossible.

Kleissner · 2021-03-11T14:50:01Z

You can close all the issues that I opened. We stopped using Pogreb earlier this year when all those issues appeared.
Unfortunately the 4 billion limit is an absolute breaker for us (we get now 4+ billion new records per month).

The plan was to keep the Pogreb running in parallel and switch over once the issues have been solved, but since this hasn't been resolved I have decided to switch over to a different key-value database.

derkan · 2021-03-11T16:00:20Z

@Kleissner just curious, what are you using now?

gnewton · 2021-05-13T18:21:16Z

Yes, the 4B record limit is a deal breaker for me also. I was hoping on using this instead of bolt, but now cannot. Any chance of changing this? It is a real limit for people with large # items to manage.
BTW, this is very impressive work.

Kleissner · 2021-05-29T15:35:20Z

@derkan we have tried:

Postgres: Obviously an overkill for just storing key-value
Badger: Buggy, crashes sometimes, corrupts database. Updates break compatibility. Uses C code.
Bolt: No longer actively maintained, suffers from out of memory crashes and corrupted database.
Bitcask: High memory usage (more than disk).
Pogreb: Pure Go, but not more than 4 billion records supported

We fell back to continue using Bitcask, but half abandoned our internal project altogether since no suitable key-value database was found. Each new run takes a few weeks to recompile the key-value database (since we have billions of records) and is therefore resource and time intensive.

fahmifan · 2021-12-10T14:15:36Z

@Kleissner have you check etcd-io/bbolt? It was a forked of bolt db and still maintained by etcd team

artjoma · 2024-02-28T17:40:04Z

@Kleissner just curious, what are you using now?

Look at PebbleDB. Ethereum Geth use it as blockchain storage.

adlrocha mentioned this issue Jul 12, 2021

Pogreb-backed persistence ipni/storetheindex#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4 billion records max? #38

4 billion records max? #38

Kleissner commented Jan 17, 2021

akrylysov commented Feb 10, 2021

Kleissner commented Feb 25, 2021

akrylysov commented Feb 25, 2021

Kleissner commented Mar 11, 2021

derkan commented Mar 11, 2021

gnewton commented May 13, 2021

Kleissner commented May 29, 2021

fahmifan commented Dec 10, 2021

artjoma commented Feb 28, 2024

4 billion records max? #38

4 billion records max? #38

Comments

Kleissner commented Jan 17, 2021

akrylysov commented Feb 10, 2021

Kleissner commented Feb 25, 2021

akrylysov commented Feb 25, 2021

Kleissner commented Mar 11, 2021

derkan commented Mar 11, 2021

gnewton commented May 13, 2021

Kleissner commented May 29, 2021

fahmifan commented Dec 10, 2021

artjoma commented Feb 28, 2024