Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Upserting the same data causes the SQLite db to grow by 50-100% #2143

Open
essenciary opened this issue May 6, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@essenciary
Copy link

essenciary commented May 6, 2024

What happened?

I'm using Chroma in a Python chat type of app in order to store what could be considered entities and to do RAG on a few hundred documents. This data is mostly static - it updates very rarely, and when it does, by very little. Think a few new entities/keywords every hour and/or a couple more articles for RAG per day. However, every time I run the import scripts, even at 1 minute intervals, the SQLite DB grows by 50-100%. For example:

  • run 1: from empty db to 35 MB
  • run 2 (a few minutes later): 62 MB
  • run 3 (a few mins later): 89 MB
  • run 4 (a few mins later): 113 MB

I haven't diffed the data as it's coming from multiple sources, but I expect the data was 99.99% identical on every import.

The issue is that the db grows very fast (it was 3 GB in size in production after a few days) and Chroma becomes impossible to use (it clogs all the CPU cores and never fetches the data at that size).


PS - looking at the expansion rate, seems to grow by more or less the initial 35 MB.

Versions

chromadb 0.4.24

python 3.10.10

LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.9.2009 (Core)
Release: 7.9.2009

Relevant log output

No response

@essenciary essenciary added the bug Something isn't working label May 6, 2024
@essenciary
Copy link
Author

essenciary commented May 6, 2024

Code:

for article in fetch_content_articles(content_type):
    sections = []
    try:
      sections = json.loads(article[3])
    except Exception as _:
      pass

    content_article = {
      "id": str(article[0]), # chromadb expects a string, not an integer
      "documents": "".join([
        markdownify.markdownify('<h2>' + doc['sections_title'] + '</h2>' + doc['sections_content']) for doc in sections
      ]),
      "metadata": {
        "title": article[1],
        "slug": article[2],
        "image": article[4] or "",
        "updated_at": article[5].timestamp(), # chromadb expects a timestamp, not a datetime object
        "article_preview": article[7] or "",
        "type": CONTENT_ARTICLES_TYPES[article[8]],
        "geo": "ie" if article[9] == 2 else "uk"
      }
    }

    embeddings.add( entity=collection_name, 
                    ids=[content_article['id']],
                    items=[content_article['documents']],
                    metadata=[content_article['metadata']]
                  )

embeddings.add()

def add(
    entity: str, ids: list[str], items: list[str], metadata: list[dict] | None = None
):
    try:
        get_collection(entity).upsert(
            ids=ids,
            documents=items,
            metadatas=metadata,
        )
    except Exception as ex:
        print(ex)
        pass

@HammadB
Copy link
Collaborator

HammadB commented May 6, 2024

I suspect that most of the expansion here is coming from the WAL. unfortunately we don't have first party support for cleaning the WAL right now but @tazarov has some community supported tools.

We hope to add this to the core API.

@tazarov
Copy link
Contributor

tazarov commented May 7, 2024

@essenciary this is an explanation of how the WAL works - https://cookbook.chromadb.dev/core/advanced/wal/

And here's the explanation of how to prune (clean) it up: https://cookbook.chromadb.dev/core/advanced/wal-pruning/. The tooling is here: https://github.com/amikos-tech/chromadb-ops.

⚠️ ALWAYS make backups 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants