Replies: 1 comment 2 replies
-
Unfortunately extracted structured information from a PDF is quite complex. PDF only describes the final appearance of a document, and has no concept of what text is doing. Even deciding where word breaks are is not straightforward. You can use a library like pdfminer.six to extract text, and from here you may be able to use some heuristic such as checking for font that is larger than average, and classify that as a header. You could also a commercial PDF editor to create a Tagged PDF, in which you label all PDF text as belonging to a class similar to HTML. pikepdf is a library that could be used for this sort of work, but it's domain specific and complex work that needs to be fine tuned for incoming PDFs. |
Beta Was this translation helpful? Give feedback.
-
Hai,
Thank you for providing a beautiful library. I am new to pikepdf and this library fells me bit complex.
I am trying to extract the heading names from the pdf file and gone through the documentation but still did not get anything so, can someone please guide me is there any method or class available to do this task.
Below attached the sample pdf file.
Thank you in advance.
Sample PDF file.pdf
Beta Was this translation helpful? Give feedback.
All reactions