Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyMuPdf Hierarchal Headings #35

Closed
mingzhang798 opened this issue Apr 26, 2024 · 2 comments
Closed

PyMuPdf Hierarchal Headings #35

mingzhang798 opened this issue Apr 26, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@mingzhang798
Copy link

Description

Can you combine pymupdf's pdf4llm.to_markdown() to make the parsed pdf more hierarchical (for example, use ("##", "Header 1") to represent the first-level heading, ("###", "Header 2") represents the second-level heading, ("####", "Header 3") represents the third-level heading, etc.), so that langchain can be used to parse using the MarkdownHeaderTextSplitter() method.
link: https://python.langchain.com/docs/modules/data_connection/document_transformers/markdown_header_metadata/

@Filimoa
Copy link
Owner

Filimoa commented Apr 28, 2024

Could you provide some examples of before and after?

@Filimoa Filimoa changed the title suggestion PyMuPdf Hierarchal Headings Apr 28, 2024
@Filimoa Filimoa added the enhancement New feature or request label Apr 28, 2024
@Filimoa
Copy link
Owner

Filimoa commented Jun 4, 2024

Closing due to inactivity

@Filimoa Filimoa closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants