Add parallellization and caching to image download #882

dhgelling · 2021-04-30T12:25:12Z

In my usage, a bit speed bottleneck is the sequential downloading of images from an article when finding the top image.
While the current implementation attempts to only download partial images if possible, a new session is used for each image, and images are not downloaded in parallel, though each download can take up to a second, depending on the website that's being looked at. What's more, when scraping multiple articles from the same website, the same images are downloaded multiple times, and the top image that's chosen is requested twice - once when searching for the top image, and once when checking requirements.

This PR attempts to fix it by starting multiple image downloads in parallel, as well as caching downloaded images for up to 5 hours. This way, the time taken to scrape one article can be reduced from >30 seconds to 2-3 seconds, and scraping multiple articles from the same site will download fewer images.

The downside is that streaming downloading does not work with the parallel implementation. The streaming didn't improve the download time by much however, so the main downside here is the amount of data transferred.

dhgelling added 2 commits April 30, 2021 14:14

Add parallellization and caching to image download

557cb98

add dependency cachetools to requirements.txt

8668850

AndyTheFactory mentioned this pull request Oct 24, 2023

Add parallellization and caching to image download AndyTheFactory/newspaper4k#512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallellization and caching to image download #882

Add parallellization and caching to image download #882

dhgelling commented Apr 30, 2021

Add parallellization and caching to image download #882

Are you sure you want to change the base?

Add parallellization and caching to image download #882

Conversation

dhgelling commented Apr 30, 2021