Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Knowledge base addition does not work. #2411

Open
aomeng1219 opened this issue Apr 7, 2024 · 4 comments
Open

[Bug]: Knowledge base addition does not work. #2411

aomeng1219 opened this issue Apr 7, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@aomeng1219
Copy link

What happened?

Running process, and uploading PDF files are errors, but when looking for the contents of these files, the reply is always unknown. It seems that nothing has been added to the knowledge base.

Relevant log output

No response

Twitter / LinkedIn details

No response

@aomeng1219 aomeng1219 added the bug Something isn't working label Apr 7, 2024
Copy link

dosubot bot commented Apr 7, 2024

Hey there @aomeng1219! 👋 I'm Dosu, your friendly neighborhood bot here to lend a hand with bugs, answer your queries, and help you dive into contributing while we wait for a human maintainer to swing by. I'll be taking a look at your issue about PDF files not adding to the knowledge base in quivr and get back to you with a full answer shortly. Hang tight! 🚀

@aomeng1219
Copy link
Author

Did PDF fail to produce text vector during import?

Copy link

dosubot bot commented Apr 7, 2024

Based on the information you've provided and the similar issues I found, it seems like the problem might be related to the state of the DocumentSerializable object when the first question is asked. If the document content is being read normally as per the system log, then the issue might be with how the content is being stored or retrieved.

One possible cause could be a race condition where the question is asked before the document content is fully processed and stored. This could explain why the second question can get a normal response, as by that time the processing might have completed.

To debug this, you could add some logging in the upload_file_storage function and the from_json method of the DocumentSerializable class to check the state of the document content at different stages. Here's an example of how you could do this:

def upload_file_storage(file, file_identifier: str):
    supabase_client: Client = get_supabase_client()
    response = supabase_client.storage.from_("quivr").upload(file_identifier, file)
    print(f"Upload response: {response}")  # Add logging
    return response

class DocumentSerializable(Document):
    page_content: str
    metadata: dict = Field(default_factory=dict)

    def to_json(self) -> str:
        return json.dumps({"page_content": self.page_content, "metadata": self.metadata})

    @classmethod
    def from_json(cls, json_str: str):
        data = json.loads(json_str)
        print(f"Data from JSON: {data}")  # Add logging
        return cls(page_content=data["page_content"], metadata=data["metadata"])

This should give you more information about what's happening when the document content is being processed.

If the issue persists, it would be helpful to know the exact version of Quivr you're using, as well as any error messages you're seeing in the console or logs. This will help narrow down the possible causes of the issue.

Regarding your question about the PDF files failing to produce a text vector during the import process, yes, this could indeed be the cause. The UnstructuredPDFLoader class from langchain_community.document_loaders is used to load and possibly parse the PDF files. If this loader fails to extract text from the PDFs correctly, it could prevent the generation of a text vector, which is essential for adding the content to the knowledge base.

To resolve this issue, you could:

  1. Verify the PDF Content: Ensure that the PDFs contain text that can be extracted. Some PDFs might have their content in images or in a format that the UnstructuredPDFLoader cannot parse.

  2. Update Dependencies: Make sure that all dependencies, especially those related to langchain_community.document_loaders, are up to date. An outdated dependency might lack the necessary functionality or fixes to parse certain PDFs.

  3. Debug the Loader: Add logging or debugging statements within the UnstructuredPDFLoader or the process_pdf function to check if the PDFs are being loaded correctly and if the text extraction process is completing successfully.

  4. Customize or Extend the Loader: If the UnstructuredPDFLoader is not suitable for your PDFs, consider customizing this loader or creating a new loader class that can handle your specific PDF format more effectively.

  5. Error Handling: Implement error handling in the process_pdf function to catch and log any exceptions during the PDF processing. This can help identify if the failure is due to an error in the text extraction process.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@aomeng1219
Copy link
Author

"Error loading punkt: <urlopen error [Errno 99] Cannot",I found the above error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant