chore(vrl): remote store/enrichment table RFC #20495

lsampras · 2024-05-14T14:34:09Z

chore(vrl): remote store/enrichment table RFC

Readable

External stores support for enrichment discussed here: #17195

lsampras · 2024-05-14T14:35:17Z

@jszwedko Can you take a look at this?

jszwedko · 2024-05-20T17:35:06Z

Thanks @lsampras ! I was OOO last week, but we'll give this a review this week.

StephenWakely · 2024-05-24T11:43:20Z

Thank you for writing this. It is definitely something that could be quite useful.

The biggest show stopper to getting this implemented is the lack of async within VRL. Any io operations would halt processing on that thread, which would be terrible for performance. One of the guiding principles behind Vector is that there should be no performance footguns.

Existing enrichment tables work by loading the data into memory at startup. This would not be doable for these proposals given the need to also write to the database.

I think there are several possible ways to move forward with this.

VRL is updated to allow functions to be async. This would be a fairly large undertaking and in the past there has been little appetite for it. But in light of this it may be be worth revisiting if there is enough demand for it.
The initial data could be loaded into memory. On updates, a writer thread is signaled. The writer thread then updates the disk storage in the background, updates the memory store and updates the indexes, once all the updates are complete the old store is swapped with the new one. This would work with the caveat that writes would not be visible straight away due to the time taken to build the indexes and each data store could not be shared by multiple Vector instances since there would be no way to signal to the other instances that they need to re-read from the store.. This could potentially be made to work, but it's clunky.
Something else that I haven't thought of. Very open to other suggestions!

lukesteensen · 2024-05-24T15:46:38Z

The main problem with any kind of async VRL function is unpredictable and likely extremely poor performance. Normal programming languages can just say that this is the responsibility of the user to manage, but the purpose of VRL is to be reliably fast, deterministic, etc. Strong constraints on what you're able to do is how we achieve that (see the lack of even things like basic loops).

I would actually suggest a different pattern for this kind of read/write use case. Instead of a single instance of VRL that is trying to do both, I would split the reads and writes into different pipelines. The write path, rather than needing to write to an enrichment table, would do the appropriate transformations and then write to the datastore via a normal sink. The read path could then use an enrichment table to read from that same datastore. This would require support for enrichment tables backed by remote datastores, but that's a pretty natural extension of the existing feature. One important caveat is that we'd still require there to be some level of local caching of queries to the remote datastore, because we do not want to block event processing in VRL on network calls.

If you need data that is perfectly up to date and not subject to stale caches like above, the other option would be to keep everything local to the vector instance. We could do this by introducing something very similar to the reduce transform that exposes it's state as an enrichment table. This would allow updating the state, but would not need to be async or blocking in any way because it would only work on local data. If persistence is required, it could support periodic checkpointing to disk or object storage, or anything else as long as it is not in the hot path of data processing.

Finally, the escape hatch option is always Lua. You can do pretty much anything you want there and simply deal with the performance consequences as they come up.

lsampras · 2024-05-28T07:44:09Z

@lukesteensen @StephenWakely thanks for your reviews,
I could gather the following points from above

Currently VRL is meant for purely compute heavy transforms and adding IO tasks here would cause performance problems and additional footguns.
This feature deviates significantly from the VRL & enrichment table principles and can cause poor performance.
There isn't any way to avoid the async nature here.
1. Even if we add caching the data would needed to be fetched for the first time
2. I believe users should be given info about whether saving to an external data store succeeded (this could even be incorporated in the end-to-end acknowledgements).

For now I can think of 2 ways to solve this

Add additional transforms with sole responsibility to get/set data to external storesa
1. These transforms can fetch the data and populate it in the event or set the data from a key to an external store.
2. This would clearly segregate the IO-heavy transforms that can cause a bottleneck and easily reveal any problems created due to network/IO tasks.
Update VRL to allow async functions (maybe opt-in)
1. VRL already supports concurrent execution adding async support would allow it to support high throughput even with IO calls.
2. VRL would also provide better ergonomics in terms of error handling and allowing users to chain IO calls.

Would we be fine with any of this? While Lua serves as an escape hatch having some sort of native support would add more safety & performance here.

chore(vrl): remote store/enrichment table RFC

22cf4a0

github-actions bot added the domain: rfc label May 14, 2024

fix typo

c7abecd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(vrl): remote store/enrichment table RFC #20495

chore(vrl): remote store/enrichment table RFC #20495

lsampras commented May 14, 2024

lsampras commented May 14, 2024

jszwedko commented May 20, 2024

StephenWakely commented May 24, 2024

lukesteensen commented May 24, 2024

lsampras commented May 28, 2024

chore(vrl): remote store/enrichment table RFC #20495

Are you sure you want to change the base?

chore(vrl): remote store/enrichment table RFC #20495

Conversation

lsampras commented May 14, 2024

lsampras commented May 14, 2024

jszwedko commented May 20, 2024

StephenWakely commented May 24, 2024

lukesteensen commented May 24, 2024

lsampras commented May 28, 2024