Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple vector stores and concurrency #76

Open
atljoseph opened this issue May 15, 2024 · 7 comments
Open

Multiple vector stores and concurrency #76

atljoseph opened this issue May 15, 2024 · 7 comments

Comments

@atljoseph
Copy link

atljoseph commented May 15, 2024

Curious about the expected performance under concurrent loads and/or with multiple persistent (long term) and in-memory (short term) vector stores being simultaneously queried. Not that I have something to through at it that would test its limits. Am curious. Trying to make an informed decision for maybe a 10+ year runway on upgrades. Other folks might be curious too. This project is vastly underrated in my opinion.

@philippgille
Copy link
Owner

Hi, thanks for the question!

expected performance under concurrent loads and/or with multiple persistent (long term) and in-memory (short term) vector stores being simultaneously queried

For in-memory:

The chromem-go DB contains multiple "collections", and the querying takes place per collection. So when the concurrent queries are for separate collections, they don't affect each other. There could be a write process to one collection and you can still query the other.

If the concurrent queries are for the same collection, it should also be fine as long as no write happens in between, because a read lock is used to ensure no data race condition. Multiple concurrent reads can all access the in-memory data structure concurrently. But if a write operation asks for a lock, the scheduler waits for ongoing reads to finish and then give the write operation the exclusive lock. If you do a query at this point it will have to wait until the write is finished, so there can be a delay.

The querying currently uses a number of goroutines matching the number of CPU threads. As goroutines are "cheap" / "green threads", running concurrent queries (and thus more goroutines than you have CPU threads) doesn't lead to much overhead, but still the CPU is shared so you might see a performance decrease. I haven't measured this yet, but could do it if it's a blocking question for you to move on with using chromem-go.

So far I did one round of performance improvements in v0.5.0 for what I think is the most common use case (non-concurrent querying without metadata or document filter). Based on user's needs I can do more optimizations, or for other use cases.

For persistence:

Currently the persistence is a fairly naive implementation, with one file per document, while the data is also kept in memory. This means for querying there is no performance penalty, because the data isn't read from disk. Only writes go to disk. Persistent data is only read on DB initialization (chromem.NewPersistentDB() or DB.Import() call).

I have plans for several enhancements around persistence, where one is to use a write-ahead or append-only log, with one file for the DB or collection instead of one per document. This will make concurrent writes more performant. And another is to offload document contents (not embeddings) to files and not keep them in memory. This will lead to a huge reduction in memory usage for the entire DB, and querying should stay almost as fast as before as long as no document filtering is used, because only for the final n documents a read from disk will be required. These should be optional features so users can opt-in to them.

For document filtering:

When you include filters on document content it slows down the query a lot, because I haven't done any optimization for that use case yet. There are probably some low hanging fruits to improve the performance for this. And the next level of performance (for document content filtering) could be achieved when using something like roaring bitmaps. But I'd only look into the latter when users start voicing a need for it or needing it myself.

maybe a 10+ year runway on upgrades

So far this is a side-project and I maintain it while working a regular job. Thanks to the library being dependency-free there's a lower need for regular updates (library version bumps for security fixes or general improvements), so I'm confident that I'm able to maintain it for a while, but without any sponsorship I won't make any promises. There's always the option to make someone from the community a maintainer, or move the project to a GitHub organization with members from the community.

This project is vastly underrated in my opinion.

Thanks for the kind words! 🙂

I hope I was able to answer most of your questions. Otherwise feel free to ask in more detail, or follow-up questions.

@philippgille
Copy link
Owner

Ah P.S.: If we can land #48 it should lead to another performance improvement for queries

@philippgille
Copy link
Owner

philippgille commented May 17, 2024

P.P.S.: Another feature on the roadmap for persistence is to not be file-based at all, but allow the user to pass an implementation of any kind of key-value store, for example with https://github.com/philippgille/gokv, and store documents there. But just like with the write-ahead or append-only log this would only affect writes, not reads.

@atljoseph
Copy link
Author

atljoseph commented May 18, 2024 via email

@atljoseph
Copy link
Author

atljoseph commented May 18, 2024 via email

@philippgille
Copy link
Owner

Am looking to make a knowledge base creation and curation app, with ability to query the data directly as well as have a chat with it…. All in golang. Am planning to go very deep with it. Happy to meetup and discuss if you’d like.

I've thought about creating something similar, without curation, but with pluggable data sources. Some apps exist in the space, like Danswer (in Python). Frameworks like LlamaIndex and Haystack (also both Python) show that this is a popular use case for vector stores.

That means I probably need to keep related source docs of different source mediums in separate collections with similar names to be able to query all the web-crawled material from a particular knowledge base as opposed to the GitHub code.

Do you mean the code example in the GitHub repo that uses "knowledge-base" as collection name? Then yes, that's just a simplified example, and I would suggest to use one collection per data source.

@iwilltry42
Copy link
Contributor

We're building something in the "knowledge base" realm as well over here: gptscript-ai/knowledge.
I gave a quick demo on it as well if you're curious: https://www.youtube.com/watch?v=4kdESyzw4kY
It's still pretty rough around the edges and I'm currently working on customizable ingestion/retrieval chains/flows that can be associated with knowledge bases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants