Discussion: Storage backend 2.0 #90

rkrzr · 2022-12-21T15:59:08Z

Intro

Our current file-based storage backend is starting to reach its scalability limits. Icepeak will now occasionally serve 503s (Service Unavailable- "Dropped Icepeak status update, the queue was full") when there are too many incoming requests at the same time during load spikes.
Icepeak is working as expected here: whenever its internal queue is full it should serve a 503. However, the goal of a new storage backend would be to increase the throughput that we can achieve so that we can handle a higher request load before having to serve 503s.

There are a few requirements for a new storage backend:

must be able to store JSON objects
must allow us to subscribe and update arbitrary paths in the JSON object
should give us higher write throughput
should not make read throughput worse
should allow us to store larger datasets more efficiently

Additionally, it would be nice if we can keep the current "zero configuration" approach, where you don't have to set up any external services first, and don't have to configure any schemas first, but can simply start using Icepeak right away.

Storage options

There are various existing storage backends that we could use here. First, there are embedded Key/Value store like LevelDB or RocksDB. Second, there are embedded databases like SQLite or DuckDB. Third, it would also be possible to extend our current file-based system by splitting it up into multiple files (e.g. sharding on the top-level keys).

Note: I won't consider non-embeddable systems like Redis and Postgres here. The extra complexity of having to set them up and configure them runs counter to Icepeak's current simplicity and zero-config nature. In the unlikely case that we ever need to scale Icepeak further than what we can do with an embedded datastore, we can reconsider.

Each of these options have various pros and cons. I will discuss two of them below.

Option 1: RocksDB

RocksDB is a storage engine with key/value interface, where keys and values are arbitrary byte streams. It is a C++ library. It was developed at Facebook based on LevelDB and provides backwards-compatible support for LevelDB APIs.

Pros:

Simple K/V store where keys and values are arbitrary byte streams
Embeddable: we can easily ship it together with icepeak and it won't need any configuration
There are Haskell bindings for it
Supports transactions
Supports Prefix seek which would allow us to optimize a JSON path get

Cons:

Encoding a JSON value into a K/V structure requires more implementation work and it's not obvious what the best way to do it is
There is a trade-off between storing a larger JSON object as a single value vs. breaking it up into individual k/v pairs for each leaf value. The former comes with faster reads when you want to read a whole value (at the cost of having to always write the whole value, when you change any part of it), while the latter comes with fine-grained reads and writes, at the cost of expensive reads and writes when you want to read/write a larger JSON object (since each leaf value has to read or written individually).

One nice thing of using a K/V store would be that we could abstract it with a simple type class for get/put/delete which could then have several instances for e.g. LevelDB and RocksDB. This would allow us to easily benchmark the different implementations.

Option 2: SQLite

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.

Pros:

Embeddable: we can easily ship it together with icepeak and it won't need any configuration
full power of SQL to model our data. Including FKs to ensure data integrity.
full power of SQL to query our data with full transaction support
Haskell bindings are available
Supports indexes to speed up queries
It's easy to inspect the data for e.g. debugging
Support for [JSON functions and operators](JSON Functions And Operators )

Cons:

Using indexes requires some knowledge of the schema. However, since Icepeak supports free-form JSON objects we likely won't be able to add any indexes automatically. This would still be a manual task/optimization that a user could take advantage of themselves.
We would need to generate a SQL schema that is flexible enough to store any JSON object. We could not take advantage of data-specific domain knowledge (as long as we want to keep the zero-config requirement).

Conclusion

In my view an embedded data store like RocksDB or SQLite makes the most sense for the second iteration of Icepeak's storage backend. My preference would probably be for SQLite since it gives us all of the power of SQL, both for modeling the data, and for querying it.

ReinierMaas added the enhancement label Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Storage backend 2.0 #90

Discussion: Storage backend 2.0 #90

rkrzr commented Dec 21, 2022 •

edited

Discussion: Storage backend 2.0 #90

Discussion: Storage backend 2.0 #90

Comments

rkrzr commented Dec 21, 2022 • edited

Intro

Storage options

Option 1: RocksDB

Option 2: SQLite

Conclusion

rkrzr commented Dec 21, 2022 •

edited