How to extract heading names from PDF file ? #459

Laxmi530 · 2023-03-17T13:38:09Z

Laxmi530
Mar 17, 2023

Hai,
Thank you for providing a beautiful library. I am new to pikepdf and this library fells me bit complex.
I am trying to extract the heading names from the pdf file and gone through the documentation but still did not get anything so, can someone please guide me is there any method or class available to do this task.
Below attached the sample pdf file.

Thank you in advance.

Sample PDF file.pdf

jbarlow83 · 2023-03-17T23:58:49Z

jbarlow83
Mar 17, 2023
Maintainer

Unfortunately extracted structured information from a PDF is quite complex. PDF only describes the final appearance of a document, and has no concept of what text is doing. Even deciding where word breaks are is not straightforward.

You can use a library like pdfminer.six to extract text, and from here you may be able to use some heuristic such as checking for font that is larger than average, and classify that as a header.

You could also a commercial PDF editor to create a Tagged PDF, in which you label all PDF text as belonging to a class similar to HTML.

pikepdf is a library that could be used for this sort of work, but it's domain specific and complex work that needs to be fine tuned for incoming PDFs.

2 replies

Laxmi530 Mar 18, 2023
Author

@jbarlow83 Thanks for the response.
Can you please guide me how to get the font name and font size using pikepdf.

Thank you

jbarlow83 Mar 18, 2023
Maintainer

I recommend using pdfminer.six instead for that task.

pikepdf is a different tool, in particular given its ability to modify PDFs in standards compliant ways. pdfminer.six cannot output PDFs at all, but it is great at text extraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract heading names from PDF file ? #459

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to extract heading names from PDF file ? #459

Laxmi530 Mar 17, 2023

Replies: 1 comment · 2 replies

jbarlow83 Mar 17, 2023 Maintainer

Laxmi530 Mar 18, 2023 Author

jbarlow83 Mar 18, 2023 Maintainer

Laxmi530
Mar 17, 2023

Replies: 1 comment 2 replies

jbarlow83
Mar 17, 2023
Maintainer

Laxmi530 Mar 18, 2023
Author

jbarlow83 Mar 18, 2023
Maintainer