New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF files getting rejected in parse step #512
Comments
@BakedJesus - we will take a look. Its a big file (500+ pages), but that should be fine. A couple of quick checks-
We will take a closer look at the file and come back |
Hey @doberst, thanks for responding.
I'm pasting the output when I call the parser directly on the pdf directory;
Do you have any suggestion on how I can debug this issue? |
Hi, I do have an issue with pdf_parser as well but a slightly different one. It does not reject a whole document but is able to extract just 16 pages out of 248 page document. Currently it is not possible to debug or investigate problems with the pdf_parser (or I haven't found one), would it be possible to share the source code of those binaries? Or make it possible to fallback to different implementation based on for example PyMuPDF? I do have my custom tokenizer based on PyMuPDF and it is able to read whole 248 pages. |
🤔 @arekglowacki May I ask you to open an issue that we can categories as a feature request? Thank you! |
Out of a total of 4 files, 2 of my pdf files are being rejected. I've included one of the files as a sample.
My setup:
Ubuntu 22.04 -> Anaconda python 3.10
docker compose up [mongo and milvus]
To reproduce, simply use library.add_files on my provided file.
sample_rejected.pdf
The text was updated successfully, but these errors were encountered: