Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: small fix for the insert-method of pgvector #1004

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArneJanning
Copy link

@ArneJanning ArneJanning commented Feb 14, 2024

Please describe the purpose of this pull request.
Thankfully @sarahwooders provided a fix for #988 , but assumed a fixed chunk size of 1,000 for pg8000 which slows down the ingestion-process significantly and is not future-proof.

So I added a small method to get the optimal chunk size based on the number of columns in the database.

How to test
memgpt load directory --name <some_name> --input-files=<very_large_input_files>

Have you tested this PR?
I tested it with several very large PDFs and TXTs (around 1.000 pages of scientific content).

Related issues or PRs
#988

Is your PR over 500 lines of code?
No.

@sarahwooders sarahwooders self-requested a review February 14, 2024 18:14
@sarahwooders sarahwooders changed the title small fix for the insert-method of pgvector perf: small fix for the insert-method of pgvector Feb 14, 2024
Copy link
Collaborator

@sarahwooders sarahwooders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

A minor nit - could you please set a self.insert_chunk_size = self.get_optimal_chunk_size in the class initialization, so we can avoid calling the function on every insert?

Also, would you be able to write a test to ensure that the chunking is working properly? Unfortunately all our tests user very small data, so I dont think this functionality will be covered. Maybe you could generate a long list of Passage objects and insert them?

Also, could you please just run the formatter with poetry run black --check . -l 140?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

None yet

3 participants