How to consolidate pages from a PDF document loaded with PyMuPDFReader #11557

vecorro · 2024-03-02T01:48:29Z

vecorro
Mar 2, 2024

Hi,

I loaded multiple PDFs using PyMuPDFReader as it is faster than the default reader but it delivers each page of a PDf as a document object. How can I aggregate all the pages of each document together?

@vecorro · 2024-03-02T01:48:39Z

dosubot[bot]
bot Mar 2, 2024

Hey @vecorro! Great to see you back. Hope you're doing well!

To aggregate all the pages of each PDF document together after loading them with PyMuPDFReader, you can modify the load method to concatenate the text of all pages into a single Document object instead of creating a list of Document objects for each page. Here's how you can do it:

def load(
    self,
    file_path: Union[Path, str],
    metadata: bool = True,
    extra_info: Optional[Dict] = None,
) -> List[Document]:
    """Loads documents from PDF file as a single aggregated document."""
    import fitz

    if not isinstance(file_path, (str, Path)):
        raise TypeError("file_path must be a string or Path.")

    doc = fitz.open(file_path)

    if extra_info and not isinstance(extra_info, dict)):
        raise TypeError("extra_info must be a dictionary.")

    aggregated_text = ""
    for page in doc:
        aggregated_text += page.get_text()

    if metadata:
        if not extra_info:
            extra_info = {}
        extra_info["total_pages"] = len(doc)
        extra_info["file_path"] = str(file_path)

    return [
        Document(
            text=aggregated_text.encode("utf-8"),
            extra_info=extra_info or {}
        )
    ]

This modification changes the behavior of the load method to aggregate all page texts into a single string, which is then used to create a single Document object. The extra_info dictionary is updated with metadata if required, and the method returns a list containing this single Document object.

You can find the original load method in the PyMuPDFReader class in the LlamaIndex repository. Please note that this modification should be done in your local copy of the class and not in the repository itself.

Sources

llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/pymu_pdf/base.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

vecorro · 2024-03-02T01:58:29Z

vecorro
Mar 2, 2024
Author

Thanks,

I can't use this solution because an update to LlamaIndex (updates happen almost every day) would overwrite this customization. Is there a way to do it after the document objects have been already created?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to consolidate pages from a PDF document loaded with PyMuPDFReader #11557

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

{{title}}

Select a reply

How to consolidate pages from a PDF document loaded with PyMuPDFReader #11557

vecorro Mar 2, 2024

Replies: 2 comments

dosubot[bot] bot Mar 2, 2024

Sources

About Dosu

vecorro Mar 2, 2024 Author

vecorro
Mar 2, 2024

dosubot[bot]
bot Mar 2, 2024

vecorro
Mar 2, 2024
Author