PyMuPdf Hierarchal Headings #35

mingzhang798 · 2024-04-26T08:04:42Z

Description

Can you combine pymupdf's pdf4llm.to_markdown() to make the parsed pdf more hierarchical (for example, use ("##", "Header 1") to represent the first-level heading, ("###", "Header 2") represents the second-level heading, ("####", "Header 3") represents the third-level heading, etc.), so that langchain can be used to parse using the MarkdownHeaderTextSplitter() method.
link: https://python.langchain.com/docs/modules/data_connection/document_transformers/markdown_header_metadata/

Filimoa · 2024-04-28T17:47:58Z

Could you provide some examples of before and after?

Filimoa · 2024-06-04T14:57:22Z

Closing due to inactivity

Filimoa changed the title ~~suggestion~~ PyMuPdf Hierarchal Headings Apr 28, 2024

Filimoa added the enhancement New feature or request label Apr 28, 2024

Filimoa closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyMuPdf Hierarchal Headings #35

PyMuPdf Hierarchal Headings #35

mingzhang798 commented Apr 26, 2024

Filimoa commented Apr 28, 2024

Filimoa commented Jun 4, 2024

PyMuPdf Hierarchal Headings #35

PyMuPdf Hierarchal Headings #35

Comments

mingzhang798 commented Apr 26, 2024

Description

Filimoa commented Apr 28, 2024

Filimoa commented Jun 4, 2024