Is there a way to stop spider check duplicate with redis ? #242

milkeasd · 2022-04-02T20:30:02Z

My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?

Moreover, I want to stop the duplication check to reduce the number of connection.

But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.

Is there any other ways to stop the duplicate check ?

Or any ideas can help speed up the process ?

Thanks

LuckyPigeon · 2022-04-03T02:56:59Z

@Germey Any ideas?

LuckyPigeon · 2022-04-03T03:01:28Z

@milkeasd
Could you provide related code files?

LuckyPigeon · 2022-04-03T05:21:58Z

The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS can be two great features.

LuckyPigeon · 2022-04-08T16:52:04Z

@milkeasd
For disable DUPEFILTER_CLASS, try this https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url

Germey · 2022-04-09T06:11:51Z

@milkeasd could you please provide your code or make some sample code?

sify21 · 2024-06-07T10:10:01Z

@LuckyPigeon it doesn't work. setting DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter" will report this error:

builtins.AttributeError: type object 'BaseDupeFilter' has no attribute 'from_spider'

Maybe there should be a custom BaseDupeFilter in scrapy-redis like RFPDupeFilter:

scrapy-redis/src/scrapy_redis/dupefilter.py

Line 128 in 48a7a89

def from_spider(cls, spider):

From scrapy's doc: https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class

You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.

LuckyPigeon added the question label Apr 3, 2022

LuckyPigeon added feature improvement labels Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to stop spider check duplicate with redis ? #242

Is there a way to stop spider check duplicate with redis ? #242

milkeasd commented Apr 2, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022 •

edited

LuckyPigeon commented Apr 8, 2022 •

edited

Germey commented Apr 9, 2022

sify21 commented Jun 7, 2024 •

edited

Is there a way to stop spider check duplicate with redis ? #242

Is there a way to stop spider check duplicate with redis ? #242

Comments

milkeasd commented Apr 2, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022 • edited

LuckyPigeon commented Apr 8, 2022 • edited

Germey commented Apr 9, 2022

sify21 commented Jun 7, 2024 • edited

LuckyPigeon commented Apr 3, 2022 •

edited

LuckyPigeon commented Apr 8, 2022 •

edited

sify21 commented Jun 7, 2024 •

edited