The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration #2406

zopieux · 2024-04-07T19:08:36Z

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler) but the request queue is generic. Request queue V1.

Issue description

With a config like this:

PlaywrightCrawler({
  requestHandlerTimeoutSecs: 10,
  navigationTimeoutSecs: 10,
  useSessionPool: true,
  persistCookiesPerSession: true,
  sessionPoolOptions: {
    maxPoolSize: 3,
    sessionOptions: {
      // Important, to recycle sessions often to spread on all proxies.
      maxUsageCount: 2,
    },
  },
  minConcurrency: 4,
  maxConcurrency: 6,
  maxRequestsPerMinute: 200,
  maxRequestRetries: 5,
})

And late into the crawling:

$ find datasets -type f | wc -l
446234
$ find request_queues -type f | wc -l
447994

but with very few remaining pending queries (<10), Crawlee is behaving weirdly. It idles for minutes at a time, just outputting the occasional AutoscaledPool metrics with {"currentConcurrency":0,"desiredConcurrency":6}. But none of the limits are reached:

"isSystemIdle":true
"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0}
"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.249}
"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0}
"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

Then after a while, it actually starts doing the few requests that are actually pending, then quickly goes back to idling, and the cycle continues, therefore making very slow progress. None of the requests are failing. The limit per minute is also not reached:

"requestsFinishedPerMinute":125  // < 200

What I have observed that might help debugging this:

While actually requesting, crawlee node subprocesses use a reasonable amount of CPU.
While doing nothing in-between however, one of the crawlee node subprocesses goes to 200% (2 full cores) and another one goes to 100% (1 full core) for the entire idle duration.

This behaves the same if run under tsx or compiling it first with tsc.

Any idea why this happening? How can I force concurrency to not be terrible through configs? minConcurrency is seemingly being ignored or overridden by another "internal" mechanism – or, more likely because of the CPU usage, by a very CPU-intense processing that is O(n²) or worse where n is the amount of queued requests, including done, therefore making Crawlee slower and slower as scraping progresses. Thanks!

Per my following comment, this is happening because each iteration (after it's done with pending requests) needs to scan the entire request_queues directory, which involves creating the lock, reading each file and/or writing them back. In a large crawl like mine (450k), that's 1M disk file operations just to collect the 1 to 10 newly queued requests, which completely defeats the concurrency.

I would suggest to make at least the done (non-failed, non-pending) uniqueKeys into an in-memory hashmap so it does not have to be scanned over and over, as in general these requests constitute the largest percentage of requests.

Code sample

No response

Package version

crawlee@3.8.2

Node.js version

v20.11.1

Operating system

No response

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

The text was updated successfully, but these errors were encountered:

zopieux · 2024-04-07T20:11:24Z

Uhh, I think I got it. On a whim I decided to monitor file accesses to request_queues directory, and it turns out Crawlee is doing a full open/read/write of every single json file in there, as well as its .lock, including the ones that are already done. That's 447'994 × N file operations per cycle, and each cycle becomes shorter and shorter as we arrive to deep pages with just a few new links discovered per cycle.

N is some internal number around 2, meaning it takes a good million file operations before Crawlee can do the next iteration at my scale. I moved the request_queues to a ramfs but it's barely helping, it just helps with my SSD wear I guess :-)

I see that a V2 request queue is being slated for a future release. Is this new request queue moving the done requests in a less I/O intensive place, e.g. an in-memory hashmap (set) of the uniqueKeys? Storing 400k keys in an hashmap is peanuts. That would help tremendously with performance (and disk wear!).

zopieux · 2024-04-07T20:34:31Z

I was hopping I could solve this with:

const memoryStorage = new MemoryStorage({ persistStorage: true, writeMetadata: false })
const requestQueue = await RequestQueue.open(null, { storageClient: memoryStorage })
const crawler = new PlaywrightCrawler({
  requestQueue,
})

but sadly persistStorage: true does not mean what I thought ("read all at init, save all at exit"): instead it continues to do a full scan of the persistence directory for each cycle.

But at this stage of my scrape (towards the very end) I obviously cannot afford to start from scratch and never persist the state: the list of already-scraped URLs is very important.

zopieux added the bug Something isn't working. label Apr 7, 2024

zopieux changed the title ~~PlaywrightCrawler grinding to a halt (concurrency: 0) while using 3 CPU cores, ignoring minConcurrency~~ The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration Apr 7, 2024

zopieux mentioned this issue Apr 7, 2024

Investigate printing a log message if starting a crawler takes a long while due to memory storage loading everything #2248

Open

B4nan assigned vladfrangu Apr 8, 2024

mtrunkat added t-tooling Issues with this label are in the ownership of the tooling team. t-console Issues with this label are in the ownership of the console team. t-c&c Team covering store and finance matters. labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration #2406

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration #2406

zopieux commented Apr 7, 2024 •

edited

zopieux commented Apr 7, 2024 •

edited

zopieux commented Apr 7, 2024 •

edited

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration #2406

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration #2406

Comments

zopieux commented Apr 7, 2024 • edited

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

zopieux commented Apr 7, 2024 • edited

zopieux commented Apr 7, 2024 • edited

zopieux commented Apr 7, 2024 •

edited

I have tested this on the `next` release

zopieux commented Apr 7, 2024 •

edited

zopieux commented Apr 7, 2024 •

edited