Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallellization and caching to image download #512

Open
AndyTheFactory opened this issue Oct 24, 2023 · 0 comments
Open

Add parallellization and caching to image download #512

AndyTheFactory opened this issue Oct 24, 2023 · 0 comments
Labels
enhancement New feature or request PR-verify Has a PR, must be checked

Comments

@AndyTheFactory
Copy link
Owner

Issue by dhgelling
Fri Apr 30 12:25:12 2021
Originally opened as codelucas/newspaper#882


In my usage, a bit speed bottleneck is the sequential downloading of images from an article when finding the top image.
While the current implementation attempts to only download partial images if possible, a new session is used for each image, and images are not downloaded in parallel, though each download can take up to a second, depending on the website that's being looked at. What's more, when scraping multiple articles from the same website, the same images are downloaded multiple times, and the top image that's chosen is requested twice - once when searching for the top image, and once when checking requirements.

This PR attempts to fix it by starting multiple image downloads in parallel, as well as caching downloaded images for up to 5 hours. This way, the time taken to scrape one article can be reduced from >30 seconds to 2-3 seconds, and scraping multiple articles from the same site will download fewer images.

The downside is that streaming downloading does not work with the parallel implementation. The streaming didn't improve the download time by much however, so the main downside here is the amount of data transferred.


dhgelling included the following code: https://github.com/codelucas/newspaper/pull/882/commits

@AndyTheFactory AndyTheFactory added enhancement New feature or request PR-verify Has a PR, must be checked labels Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR-verify Has a PR, must be checked
Projects
None yet
Development

No branches or pull requests

1 participant