Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

Closed
statefb opened this issue May 7, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@statefb
Copy link
Contributor

statefb commented May 7, 2024

Describe the solution you'd like

A solution that allows for efficient and parallel processing of PDF files during the embedding task. This solution should consider the following aspects:

  1. Differential Processing: Instead of processing all PDF files from scratch, the solution should be able to identify and process only the modified or new files, reducing redundant computations.

  2. Parallel Processing: The solution should leverage parallel processing capabilities to speed up the embedding task for multiple PDF files simultaneously.

Why the solution is needed

Currently, processing 100 PDF files with varying page counts (ranging from 10 to 100 pages) takes more than two hours. This is due to the current approach, which processes all files in a batch without considering any differences or utilizing parallel processing capabilities.

By implementing differential processing and parallel processing, the overall processing time can be significantly reduced, leading to improved efficiency and faster turnaround times.

Additional context

  • The current approach processes all PDF files from scratch, even if some files have not been modified since the last processing.
  • Parallel processing capabilities are currently not utilized, resulting in sequential processing of PDF files, which can be time-consuming for large numbers of files.

Possible solutions

  1. Hash-based Differential Processing:

    • Store the hash values of the processed PDF files along with their embedding data.
    • During the embedding task, calculate the hash values of the PDF files and compare them with the stored hash values.
    • Process only the files whose hash values have changed, indicating a modification or a new file.
  2. Temporary Table for Differential Processing:

    • Create a temporary table to store the update information for modified or new PDF files during the embedding task.
    • After processing all files, perform a batch update on the main table using the data from the temporary table.
  3. Parallel Processing:

    • Split the PDF files into smaller batches and process them concurrently, leveraging the available computing resources.

By combining these solutions, we can achieve efficient differential processing and parallel processing for the embedding task, significantly reducing the overall processing time for large numbers of PDF files.

@statefb statefb changed the title [Feature Request] [Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task May 7, 2024
@statefb statefb added the enhancement New feature or request label May 7, 2024
@fsatsuki fsatsuki mentioned this issue May 10, 2024
@statefb statefb closed this as completed May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant