[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

statefb · 2024-05-07T05:46:12Z

Describe the solution you'd like

A solution that allows for efficient and parallel processing of PDF files during the embedding task. This solution should consider the following aspects:

Differential Processing: Instead of processing all PDF files from scratch, the solution should be able to identify and process only the modified or new files, reducing redundant computations.
Parallel Processing: The solution should leverage parallel processing capabilities to speed up the embedding task for multiple PDF files simultaneously.

Why the solution is needed

Currently, processing 100 PDF files with varying page counts (ranging from 10 to 100 pages) takes more than two hours. This is due to the current approach, which processes all files in a batch without considering any differences or utilizing parallel processing capabilities.

By implementing differential processing and parallel processing, the overall processing time can be significantly reduced, leading to improved efficiency and faster turnaround times.

Additional context

The current approach processes all PDF files from scratch, even if some files have not been modified since the last processing.
Parallel processing capabilities are currently not utilized, resulting in sequential processing of PDF files, which can be time-consuming for large numbers of files.

Possible solutions

Hash-based Differential Processing:
- Store the hash values of the processed PDF files along with their embedding data.
- During the embedding task, calculate the hash values of the PDF files and compare them with the stored hash values.
- Process only the files whose hash values have changed, indicating a modification or a new file.
Temporary Table for Differential Processing:
- Create a temporary table to store the update information for modified or new PDF files during the embedding task.
- After processing all files, perform a batch update on the main table using the data from the temporary table.
Parallel Processing:
- Split the PDF files into smaller batches and process them concurrently, leveraging the available computing resources.

By combining these solutions, we can achieve efficient differential processing and parallel processing for the embedding task, significantly reducing the overall processing time for large numbers of PDF files.

statefb changed the title ~~[Feature Request]~~ [Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task May 7, 2024

statefb added the enhancement New feature or request label May 7, 2024

fsatsuki mentioned this issue May 10, 2024

Partition pdf #301

Merged

statefb closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

statefb commented May 7, 2024 •

edited

[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

Comments

statefb commented May 7, 2024 • edited

Describe the solution you'd like

Why the solution is needed

Additional context

Possible solutions

statefb commented May 7, 2024 •

edited