Multiple vector stores and concurrency #76

atljoseph · 2024-05-15T03:39:46Z

Curious about the expected performance under concurrent loads and/or with multiple persistent (long term) and in-memory (short term) vector stores being simultaneously queried. Not that I have something to through at it that would test its limits. Am curious. Trying to make an informed decision for maybe a 10+ year runway on upgrades. Other folks might be curious too. This project is vastly underrated in my opinion.

philippgille · 2024-05-17T18:34:41Z

Hi, thanks for the question!

expected performance under concurrent loads and/or with multiple persistent (long term) and in-memory (short term) vector stores being simultaneously queried

For in-memory:

The chromem-go DB contains multiple "collections", and the querying takes place per collection. So when the concurrent queries are for separate collections, they don't affect each other. There could be a write process to one collection and you can still query the other.

If the concurrent queries are for the same collection, it should also be fine as long as no write happens in between, because a read lock is used to ensure no data race condition. Multiple concurrent reads can all access the in-memory data structure concurrently. But if a write operation asks for a lock, the scheduler waits for ongoing reads to finish and then give the write operation the exclusive lock. If you do a query at this point it will have to wait until the write is finished, so there can be a delay.

The querying currently uses a number of goroutines matching the number of CPU threads. As goroutines are "cheap" / "green threads", running concurrent queries (and thus more goroutines than you have CPU threads) doesn't lead to much overhead, but still the CPU is shared so you might see a performance decrease. I haven't measured this yet, but could do it if it's a blocking question for you to move on with using chromem-go.

So far I did one round of performance improvements in v0.5.0 for what I think is the most common use case (non-concurrent querying without metadata or document filter). Based on user's needs I can do more optimizations, or for other use cases.

For persistence:

Currently the persistence is a fairly naive implementation, with one file per document, while the data is also kept in memory. This means for querying there is no performance penalty, because the data isn't read from disk. Only writes go to disk. Persistent data is only read on DB initialization (chromem.NewPersistentDB() or DB.Import() call).

I have plans for several enhancements around persistence, where one is to use a write-ahead or append-only log, with one file for the DB or collection instead of one per document. This will make concurrent writes more performant. And another is to offload document contents (not embeddings) to files and not keep them in memory. This will lead to a huge reduction in memory usage for the entire DB, and querying should stay almost as fast as before as long as no document filtering is used, because only for the final n documents a read from disk will be required. These should be optional features so users can opt-in to them.

For document filtering:

When you include filters on document content it slows down the query a lot, because I haven't done any optimization for that use case yet. There are probably some low hanging fruits to improve the performance for this. And the next level of performance (for document content filtering) could be achieved when using something like roaring bitmaps. But I'd only look into the latter when users start voicing a need for it or needing it myself.

maybe a 10+ year runway on upgrades

So far this is a side-project and I maintain it while working a regular job. Thanks to the library being dependency-free there's a lower need for regular updates (library version bumps for security fixes or general improvements), so I'm confident that I'm able to maintain it for a while, but without any sponsorship I won't make any promises. There's always the option to make someone from the community a maintainer, or move the project to a GitHub organization with members from the community.

This project is vastly underrated in my opinion.

Thanks for the kind words! 🙂

I hope I was able to answer most of your questions. Otherwise feel free to ask in more detail, or follow-up questions.

philippgille · 2024-05-17T18:53:59Z

Ah P.S.: If we can land #48 it should lead to another performance improvement for queries

philippgille · 2024-05-17T21:44:36Z

P.P.S.: Another feature on the roadmap for persistence is to not be file-based at all, but allow the user to pass an implementation of any kind of key-value store, for example with https://github.com/philippgille/gokv, and store documents there. But just like with the write-ahead or append-only log this would only affect writes, not reads.

atljoseph · 2024-05-18T03:39:25Z

All great info. Thank you so much. Am looking to make a knowledge base creation and curation app, with ability to query the data directly as well as have a chat with it…. All in golang. Am planning to go very deep with it. Happy to meetup and discuss if you’d like.

…

On Fri, May 17, 2024 at 5:44 PM Philipp Gillé ***@***.***> wrote: P.P.S.: Another feature on the roadmap for persistence is to not be file-based at all, but allow the user to pass an implementation of any kind of key-value store, for example with https://github.com/philippgille/gokv, and store documents there. But just like with the write-ahead or append-only log this would only affect writes, not reads. — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF632FLQDMXDMYY2OWKZBPDZCZ25TAVCNFSM6AAAAABHXKIQYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGQYTOOJTGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

atljoseph · 2024-05-18T03:47:54Z

That’s particularly great info about the filters and wheres. That means I probably need to keep related source docs of different source mediums in separate collections with similar names to be able to query all the web-crawled material from a particular knowledge base as opposed to the GitHub code. Trying to get granular and orthogonal with the options. The data sources would be an abstraction on top of the vector stores and collections. On Fri, May 17, 2024 at 11:39 PM Joseph Gill ***@***.***> wrote:

…

All great info. Thank you so much. Am looking to make a knowledge base creation and curation app, with ability to query the data directly as well as have a chat with it…. All in golang. Am planning to go very deep with it. Happy to meetup and discuss if you’d like. On Fri, May 17, 2024 at 5:44 PM Philipp Gillé ***@***.***> wrote: > P.P.S.: Another feature on the roadmap for persistence is to not be > file-based at all, but allow the user to pass an implementation of any kind > of key-value store, for example with https://github.com/philippgille/gokv, > and store documents there. But just like with the write-ahead or > append-only log this would only affect writes, not reads. > > — > Reply to this email directly, view it on GitHub > <#76 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AF632FLQDMXDMYY2OWKZBPDZCZ25TAVCNFSM6AAAAABHXKIQYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGQYTOOJTGM> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

philippgille · 2024-05-20T22:36:08Z

Am looking to make a knowledge base creation and curation app, with ability to query the data directly as well as have a chat with it…. All in golang. Am planning to go very deep with it. Happy to meetup and discuss if you’d like.

I've thought about creating something similar, without curation, but with pluggable data sources. Some apps exist in the space, like Danswer (in Python). Frameworks like LlamaIndex and Haystack (also both Python) show that this is a popular use case for vector stores.

That means I probably need to keep related source docs of different source mediums in separate collections with similar names to be able to query all the web-crawled material from a particular knowledge base as opposed to the GitHub code.

Do you mean the code example in the GitHub repo that uses "knowledge-base" as collection name? Then yes, that's just a simplified example, and I would suggest to use one collection per data source.

iwilltry42 · 2024-05-23T11:30:00Z

We're building something in the "knowledge base" realm as well over here: gptscript-ai/knowledge.
I gave a quick demo on it as well if you're curious: https://www.youtube.com/watch?v=4kdESyzw4kY
It's still pretty rough around the edges and I'm currently working on customizable ingestion/retrieval chains/flows that can be associated with knowledge bases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple vector stores and concurrency #76

Multiple vector stores and concurrency #76

atljoseph commented May 15, 2024 •

edited

philippgille commented May 17, 2024

philippgille commented May 17, 2024

philippgille commented May 17, 2024 •

edited

atljoseph commented May 18, 2024 via email

atljoseph commented May 18, 2024 via email

philippgille commented May 20, 2024

iwilltry42 commented May 23, 2024

Multiple vector stores and concurrency #76

Multiple vector stores and concurrency #76

Comments

atljoseph commented May 15, 2024 • edited

philippgille commented May 17, 2024

philippgille commented May 17, 2024

philippgille commented May 17, 2024 • edited

atljoseph commented May 18, 2024 via email

atljoseph commented May 18, 2024 via email

philippgille commented May 20, 2024

iwilltry42 commented May 23, 2024

atljoseph commented May 15, 2024 •

edited

philippgille commented May 17, 2024 •

edited