Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12x times faster next_request()/spider_idle() logic #74

Open
77cc33 opened this issue Sep 19, 2016 · 7 comments
Open

12x times faster next_request()/spider_idle() logic #74

77cc33 opened this issue Sep 19, 2016 · 7 comments
Labels

Comments

@77cc33
Copy link

77cc33 commented Sep 19, 2016

Hello,

Here is much faster way to fetch URL's from Redis as is doesn't wait for IDLE after each batch.

Here are some benchmarks first, let's run crawl links directly from file with this simple spider:

class FileLinksSpider(scrapy.Spider):
    name = "file_links"
    allowed_domains = ["http://localhost"]

    def start_requests(self):
        for url in open('links.txt').readlines():
            url = url.strip()
            if url:
                yield scrapy.Request(url)

    def parse(self, response):
        response.xpath('//b/text()').extract_first()

I got these results on my machine:

Crawled 10588 pages (at 10588 pages/min), scraped 0 items (at 0 items/min)
Crawled 21189 pages (at 10601 pages/min), scraped 0 items (at 0 items/min)

Now let's run simple Redis spider with the same URL's what were imported into Redis, with REDIS_START_URLS_BATCH_SIZE = 16

class RedisLinks(RedisSpider):
    name = 'redis_links'

    def parse(self, response):
        response.xpath('//b/text()').extract_first()

Here are bench results:

Crawled 823 pages (at 823 pages/min), scraped 0 items (at 0 items/min)
Crawled 1625 pages (at 802 pages/min), scraped 0 items (at 0 items/min)

Now let's test my updated next request function:

def sleep(secs):
    d = Deferred()
    reactor.callLater(secs, d.callback, None)
    return d
    def next_requests(self):
        """Returns a request to be scheduled or none."""
        pool_size = self.settings.getint('CONCURRENT_REQUESTS') + self.redis_batch_size
        use_set = self.settings.getbool('REDIS_START_URLS_AS_SET')

        fetch_one = self.server.spop if use_set else self.server.lpop

        urls_in_redis = self.server.scard if use_set else self.server.llen
        urls_in_work = lambda: len(self.crawler.engine.slot.scheduler) + len(self.crawler.engine.slot.inprogress)

        while urls_in_redis(self.redis_key):

            found = 0
            next_urls_cnt = pool_size - urls_in_work()

            for _ in range(next_urls_cnt):
                data = fetch_one(self.redis_key)
                if not data:
                    # Queue empty.
                    break
                req = self.make_request_from_data(data)
                if req:
                    yield req
                    found += 1
                else:
                    self.logger.debug("Request not made from data: %r", data)

            if found:
                self.logger.debug("Read %s requests from '%s'", found, self.redis_key)

            sleep(0.01)

Crawled 11060 pages (at 11060 pages/min), scraped 0 items (at 0 items/min)
Crawled 22156 pages (at 11096 pages/min), scraped 0 items (at 0 items/min)

I also want to add, that we probably may skip '+ self.redis_batch_size' and use just

pool_size = self.settings.getint('CONCURRENT_REQUESTS')

this way, we will not use any resources in Scrapy scheduler, as all urls will go into inprogress queue from the start, but I didn't check code inside scrapy enough to be sure about that, also we may also code shorter here:

urls_in_work = lambda: len(self.crawler.engine.slot.scheduler) + len(self.crawler.engine.slot.inprogress)

converts to

urls_in_work = lambda: len(self.crawler.engine.slot.inprogress)

As this operation is actually some time consuming, as it's just not simple var len returning.

Hope this code or it's idea will go into master

@rmax rmax added the feature label Dec 17, 2016
@sandeepsingh
Copy link

Hi ,thanks for this code, but after changing next_requests to this, my spiders still gets idle for and gets no request at all from redis starturls list. It keeps on getting (0 pages/min) for hours and eventually spider crashes. Please help me on this how to solve thanks.

@ghost
Copy link

ghost commented Jan 4, 2017

@77cc33 Excellent idea! Thanks.

@77cc33
Copy link
Author

77cc33 commented Feb 9, 2017

@sandeepsingh you need to be sure that you also add global sleep function

def sleep(secs):
    d = Deferred()
    reactor.callLater(secs, d.callback, None)
    return d

@rmax rmax self-assigned this Mar 2, 2017
@kazuar
Copy link

kazuar commented May 16, 2017

@77cc33 I'm curious about how you were able to get to 10588 pages/min.
I'm trying to crawl about 10K different domains and I was only able to get to 1800 pages/min.

Any points or hints on how to configure scrapy / scrapy-redis in order to maximize number of pages per min and also keep politeness?

@77cc33
Copy link
Author

77cc33 commented May 16, 2017

It was tests with localhost, so you are right it's little bit synthetic.

In your case, when you have tons of different domains - just split your domains list to chunks, and start new process of scrapy spider for each chunk, usually in scrapy most CPU consuming task is page parsing, so to get better speed - you need spread this task between different CPU cores. With scrapy-redis you need just start more same spiders in parallel and you'll get N * 1800 pages/min speed - where N is number of started processes. Usually it's good when N = Num of CPU cores in your server or less.

@kazuar
Copy link

kazuar commented May 17, 2017

Got it, thanks for the answer and tips!

@gameboyring
Copy link

@sandeepsingh you need to be sure that you also add global sleep function

def sleep(secs):
    d = Deferred()
    reactor.callLater(secs, d.callback, None)
    return d
d = Deferred()

NameError: name 'Deferred' is not defined

where is Deferred()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants