Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Storage backend 2.0 #90

Open
rkrzr opened this issue Dec 21, 2022 · 0 comments
Open

Discussion: Storage backend 2.0 #90

rkrzr opened this issue Dec 21, 2022 · 0 comments

Comments

@rkrzr
Copy link
Contributor

rkrzr commented Dec 21, 2022

Intro

Our current file-based storage backend is starting to reach its scalability limits. Icepeak will now occasionally serve 503s (Service Unavailable- "Dropped Icepeak status update, the queue was full") when there are too many incoming requests at the same time during load spikes.
Icepeak is working as expected here: whenever its internal queue is full it should serve a 503. However, the goal of a new storage backend would be to increase the throughput that we can achieve so that we can handle a higher request load before having to serve 503s.

There are a few requirements for a new storage backend:

  • must be able to store JSON objects
  • must allow us to subscribe and update arbitrary paths in the JSON object
  • should give us higher write throughput
  • should not make read throughput worse
  • should allow us to store larger datasets more efficiently

Additionally, it would be nice if we can keep the current "zero configuration" approach, where you don't have to set up any external services first, and don't have to configure any schemas first, but can simply start using Icepeak right away.

Storage options

There are various existing storage backends that we could use here. First, there are embedded Key/Value store like LevelDB or RocksDB. Second, there are embedded databases like SQLite or DuckDB. Third, it would also be possible to extend our current file-based system by splitting it up into multiple files (e.g. sharding on the top-level keys).


Note: I won't consider non-embeddable systems like Redis and Postgres here. The extra complexity of having to set them up and configure them runs counter to Icepeak's current simplicity and zero-config nature. In the unlikely case that we ever need to scale Icepeak further than what we can do with an embedded datastore, we can reconsider.


Each of these options have various pros and cons. I will discuss two of them below.

Option 1: RocksDB

RocksDB is a storage engine with key/value interface, where keys and values are arbitrary byte streams. It is a C++ library. It was developed at Facebook based on LevelDB and provides backwards-compatible support for LevelDB APIs.

Pros:

  • Simple K/V store where keys and values are arbitrary byte streams
  • Embeddable: we can easily ship it together with icepeak and it won't need any configuration
  • There are Haskell bindings for it
  • Supports transactions
  • Supports Prefix seek which would allow us to optimize a JSON path get

Cons:

  • Encoding a JSON value into a K/V structure requires more implementation work and it's not obvious what the best way to do it is
  • There is a trade-off between storing a larger JSON object as a single value vs. breaking it up into individual k/v pairs for each leaf value. The former comes with faster reads when you want to read a whole value (at the cost of having to always write the whole value, when you change any part of it), while the latter comes with fine-grained reads and writes, at the cost of expensive reads and writes when you want to read/write a larger JSON object (since each leaf value has to read or written individually).

One nice thing of using a K/V store would be that we could abstract it with a simple type class for get/put/delete which could then have several instances for e.g. LevelDB and RocksDB. This would allow us to easily benchmark the different implementations.

Option 2: SQLite

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.

Pros:

  • Embeddable: we can easily ship it together with icepeak and it won't need any configuration
  • full power of SQL to model our data. Including FKs to ensure data integrity.
  • full power of SQL to query our data with full transaction support
  • Haskell bindings are available
  • Supports indexes to speed up queries
  • It's easy to inspect the data for e.g. debugging
  • Support for [JSON functions and operators](JSON Functions And Operators )

Cons:

  • Using indexes requires some knowledge of the schema. However, since Icepeak supports free-form JSON objects we likely won't be able to add any indexes automatically. This would still be a manual task/optimization that a user could take advantage of themselves.
  • We would need to generate a SQL schema that is flexible enough to store any JSON object. We could not take advantage of data-specific domain knowledge (as long as we want to keep the zero-config requirement).

Conclusion

In my view an embedded data store like RocksDB or SQLite makes the most sense for the second iteration of Icepeak's storage backend. My preference would probably be for SQLite since it gives us all of the power of SQL, both for modeling the data, and for querying it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants