12x times faster next_request()/spider_idle() logic #74

77cc33 · 2016-09-19T21:23:02Z

Hello,

Here is much faster way to fetch URL's from Redis as is doesn't wait for IDLE after each batch.

Here are some benchmarks first, let's run crawl links directly from file with this simple spider:

class FileLinksSpider(scrapy.Spider):
    name = "file_links"
    allowed_domains = ["http://localhost"]

    def start_requests(self):
        for url in open('links.txt').readlines():
            url = url.strip()
            if url:
                yield scrapy.Request(url)

    def parse(self, response):
        response.xpath('//b/text()').extract_first()

I got these results on my machine:

Crawled 10588 pages (at 10588 pages/min), scraped 0 items (at 0 items/min)
Crawled 21189 pages (at 10601 pages/min), scraped 0 items (at 0 items/min)

Now let's run simple Redis spider with the same URL's what were imported into Redis, with REDIS_START_URLS_BATCH_SIZE = 16

class RedisLinks(RedisSpider):
    name = 'redis_links'

    def parse(self, response):
        response.xpath('//b/text()').extract_first()

Here are bench results:

Crawled 823 pages (at 823 pages/min), scraped 0 items (at 0 items/min)
Crawled 1625 pages (at 802 pages/min), scraped 0 items (at 0 items/min)

Now let's test my updated next request function:

def sleep(secs):
    d = Deferred()
    reactor.callLater(secs, d.callback, None)
    return d

    def next_requests(self):
        """Returns a request to be scheduled or none."""
        pool_size = self.settings.getint('CONCURRENT_REQUESTS') + self.redis_batch_size
        use_set = self.settings.getbool('REDIS_START_URLS_AS_SET')

        fetch_one = self.server.spop if use_set else self.server.lpop

        urls_in_redis = self.server.scard if use_set else self.server.llen
        urls_in_work = lambda: len(self.crawler.engine.slot.scheduler) + len(self.crawler.engine.slot.inprogress)

        while urls_in_redis(self.redis_key):

            found = 0
            next_urls_cnt = pool_size - urls_in_work()

            for _ in range(next_urls_cnt):
                data = fetch_one(self.redis_key)
                if not data:
                    # Queue empty.
                    break
                req = self.make_request_from_data(data)
                if req:
                    yield req
                    found += 1
                else:
                    self.logger.debug("Request not made from data: %r", data)

            if found:
                self.logger.debug("Read %s requests from '%s'", found, self.redis_key)

            sleep(0.01)

Crawled 11060 pages (at 11060 pages/min), scraped 0 items (at 0 items/min)
Crawled 22156 pages (at 11096 pages/min), scraped 0 items (at 0 items/min)

I also want to add, that we probably may skip '+ self.redis_batch_size' and use just

pool_size = self.settings.getint('CONCURRENT_REQUESTS')

this way, we will not use any resources in Scrapy scheduler, as all urls will go into inprogress queue from the start, but I didn't check code inside scrapy enough to be sure about that, also we may also code shorter here:

urls_in_work = lambda: len(self.crawler.engine.slot.scheduler) + len(self.crawler.engine.slot.inprogress)

converts to

urls_in_work = lambda: len(self.crawler.engine.slot.inprogress)

As this operation is actually some time consuming, as it's just not simple var len returning.

Hope this code or it's idea will go into master

sandeepsingh · 2016-12-30T01:45:59Z

Hi ,thanks for this code, but after changing next_requests to this, my spiders still gets idle for and gets no request at all from redis starturls list. It keeps on getting (0 pages/min) for hours and eventually spider crashes. Please help me on this how to solve thanks.

ghost · 2017-01-04T05:52:29Z

@77cc33 Excellent idea! Thanks.

77cc33 · 2017-02-09T11:08:50Z

@sandeepsingh you need to be sure that you also add global sleep function

def sleep(secs):
    d = Deferred()
    reactor.callLater(secs, d.callback, None)
    return d

kazuar · 2017-05-16T19:45:04Z

@77cc33 I'm curious about how you were able to get to 10588 pages/min.
I'm trying to crawl about 10K different domains and I was only able to get to 1800 pages/min.

Any points or hints on how to configure scrapy / scrapy-redis in order to maximize number of pages per min and also keep politeness?

77cc33 · 2017-05-16T21:04:14Z

It was tests with localhost, so you are right it's little bit synthetic.

In your case, when you have tons of different domains - just split your domains list to chunks, and start new process of scrapy spider for each chunk, usually in scrapy most CPU consuming task is page parsing, so to get better speed - you need spread this task between different CPU cores. With scrapy-redis you need just start more same spiders in parallel and you'll get N * 1800 pages/min speed - where N is number of started processes. Usually it's good when N = Num of CPU cores in your server or less.

kazuar · 2017-05-17T13:58:30Z

Got it, thanks for the answer and tips!

gameboyring · 2018-09-22T23:59:55Z

@sandeepsingh you need to be sure that you also add global sleep function
def sleep(secs):
    d = Deferred()
    reactor.callLater(secs, d.callback, None)
    return d

d = Deferred()

NameError: name 'Deferred' is not defined

where is Deferred()?

rmax added the feature label Dec 17, 2016

rmax self-assigned this Mar 2, 2017

LuckyPigeon unassigned rmax Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

12x times faster next_request()/spider_idle() logic #74

12x times faster next_request()/spider_idle() logic #74

77cc33 commented Sep 19, 2016 •

edited

sandeepsingh commented Dec 30, 2016

ghost commented Jan 4, 2017

77cc33 commented Feb 9, 2017 •

edited

kazuar commented May 16, 2017

77cc33 commented May 16, 2017

kazuar commented May 17, 2017

gameboyring commented Sep 22, 2018

12x times faster next_request()/spider_idle() logic #74

12x times faster next_request()/spider_idle() logic #74

Comments

77cc33 commented Sep 19, 2016 • edited

sandeepsingh commented Dec 30, 2016

ghost commented Jan 4, 2017

77cc33 commented Feb 9, 2017 • edited

kazuar commented May 16, 2017

77cc33 commented May 16, 2017

kazuar commented May 17, 2017

gameboyring commented Sep 22, 2018

77cc33 commented Sep 19, 2016 •

edited

77cc33 commented Feb 9, 2017 •

edited