Replies: 2 comments
-
Hey @vecorro! Great to see you back. Hope you're doing well! To aggregate all the pages of each PDF document together after loading them with PyMuPDFReader, you can modify the def load(
self,
file_path: Union[Path, str],
metadata: bool = True,
extra_info: Optional[Dict] = None,
) -> List[Document]:
"""Loads documents from PDF file as a single aggregated document."""
import fitz
if not isinstance(file_path, (str, Path)):
raise TypeError("file_path must be a string or Path.")
doc = fitz.open(file_path)
if extra_info and not isinstance(extra_info, dict)):
raise TypeError("extra_info must be a dictionary.")
aggregated_text = ""
for page in doc:
aggregated_text += page.get_text()
if metadata:
if not extra_info:
extra_info = {}
extra_info["total_pages"] = len(doc)
extra_info["file_path"] = str(file_path)
return [
Document(
text=aggregated_text.encode("utf-8"),
extra_info=extra_info or {}
)
] This modification changes the behavior of the You can find the original Sources
|
Beta Was this translation helpful? Give feedback.
-
Thanks, I can't use this solution because an update to LlamaIndex (updates happen almost every day) would overwrite this customization. Is there a way to do it after the document objects have been already created? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I loaded multiple PDFs using PyMuPDFReader as it is faster than the default reader but it delivers each page of a PDf as a document object. How can I aggregate all the pages of each document together?
Beta Was this translation helpful? Give feedback.
All reactions