How do I correctly configure crawlers to run simultaneously? #1240

windbridges · 2021-11-11T20:46:53Z

windbridges
Nov 11, 2021

I'm using the Apify SDK inside a console application that has multiple crawlers running simultaneously inside the same process. At first glance, everything works fine. But sometimes crashes like: Unexpected token { in JSON at position 430641 or Unexpected end of JSON input.

I guess the problem is that all crawlers store their session pool state in the same storage ([storage]/key_value_stores/default/SDK_SESSION_POOL_STATE.json) because I don't change their default settings. Because of the simultaneous access the data in the file sometimes overlaps, the format integrity is broken and the next time it is read it fails. This also happens again if I try to rerun the application. If I manually delete this json file, the error disappears, so it is about this file.

Can you please suggest ways to solve this problem.

pocesar · 2021-11-18T02:46:29Z

pocesar
Nov 18, 2021

yes, all of them will write to the same file without doing any lock and/or using a semaphore. you can solve this by creating different key names for this by changing the persistStateKey. check https://sdk.apify.com/docs/typedefs/cheerio-crawler-options#sessionpooloptions
you can name it anything, but you might just use a sequential number, like STATE_${id++} on each one when creating a new instance of a crawler. just notice that it won't stop read or write failures, as this depends on disk I/O, but should make it less likely to happen.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I correctly configure crawlers to run simultaneously? #1240

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How do I correctly configure crawlers to run simultaneously? #1240

windbridges Nov 11, 2021

Replies: 1 comment

pocesar Nov 18, 2021

windbridges
Nov 11, 2021

pocesar
Nov 18, 2021