Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF files getting rejected in parse step #512

Open
BakedJesus opened this issue Mar 14, 2024 · 4 comments
Open

PDF files getting rejected in parse step #512

BakedJesus opened this issue Mar 14, 2024 · 4 comments

Comments

@BakedJesus
Copy link

Out of a total of 4 files, 2 of my pdf files are being rejected. I've included one of the files as a sample.

My setup:
Ubuntu 22.04 -> Anaconda python 3.10
docker compose up [mongo and milvus]

To reproduce, simply use library.add_files on my provided file.
sample_rejected.pdf

@doberst
Copy link
Contributor

doberst commented Mar 14, 2024

@BakedJesus - we will take a look. Its a big file (500+ pages), but that should be fine. A couple of quick checks-

  1. Did you see an "encrypted file" readout in the screen display while parsing? That would be one explanation - our parser does not attempt to decrypt a PDF if encrypted and the file is skipped.
  2. Was there a segfault or other crash? (Unlikely, but not impossible with a book-sized file with a lot of images that it may have tripped something.)
  3. It looks like the file has a lot of scanned content - the parser does not apply an OCR to that content, so it is also possible that a lot of the file content was "skipped" - and would need to apply an OCR to read the scanned images.

We will take a closer look at the file and come back

@BakedJesus
Copy link
Author

BakedJesus commented Mar 15, 2024

Hey @doberst, thanks for responding.

  1. Not the file wasn't encrypted and there wasn't a readout
  2. No crashes. It just printed a list of rejected files
  3. Funnily enough, the one pdf it did manage to get was completely scanned! Whereas the two it rejected were at least digital copies.

I'm pasting the output when I call the parser directly on the pdf directory;

summary: pdf_parser - total pdf files processed - 4 summary: pdf_parser - total input files received - 4 summary: pdf_parser - total blocks created - 2700 summary: pdf_parser - total images created - 0 summary: pdf_parser - total tables created - 24 summary: pdf_parser - total pages added - 651 summary: pdf_parser - PDF Processing - Finished - time elapsed - 3.291197 update: pdf_parser - Completed Parsing - processing time - 3.291197 {'processed_files': [*HERE IT LISTS ACCEPTED FILES*], 'rejected_files': [*HERE IT LISTS THE TWO REJECTED FILES*], 'duplicate_files': []}

Do you have any suggestion on how I can debug this issue?

@arekglowacki
Copy link

Hi, I do have an issue with pdf_parser as well but a slightly different one. It does not reject a whole document but is able to extract just 16 pages out of 248 page document. Currently it is not possible to debug or investigate problems with the pdf_parser (or I haven't found one), would it be possible to share the source code of those binaries? Or make it possible to fallback to different implementation based on for example PyMuPDF? I do have my custom tokenizer based on PyMuPDF and it is able to read whole 248 pages.
If I would know the expected contract for pdf_parser I could submit a PR with PyMuPDF implementation.

@MacOS
Copy link
Contributor

MacOS commented Mar 16, 2024

🤔 @arekglowacki May I ask you to open an issue that we can categories as a feature request? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants