When does the crawler stop #182

forgeries · 2020-12-25T03:27:10Z

When does the crawler stop

mkdir700 · 2021-01-22T01:52:00Z

this is my code. it is extensions.

from scrapy.exceptions import NotConfigured
from twisted.internet import task
from scrapy import signals


class AutoCloseSpider(object):
    """
    scrapy_redis扩展插件
    
    Parameters
    ----------
    CLOSE_SPIDER_INTERVAL : float
    
    ZERO_THRESHOLD : int
    """
    
    def __init__(self, crawler, stats, interval=60.0, threshold=3):
        self.crawler = crawler
        self.stats = stats
        self.interval = interval
        self.threshold = threshold
        self.task = None
    
    @classmethod
    def from_crawler(cls, crawler):
        interval = crawler.settings.getfloat('CLOSE_SPIDER_INTERVAL')
        threshold = crawler.settings.getint('ZERO_THRESHOLD')
        
        if not interval and not threshold:
            raise NotConfigured
        
        stats = crawler.stats
        o = cls(crawler, stats, interval, threshold)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
        return o
    
    def spider_opened(self, spider):
        # 记录上一次的数据量
        self.request_count_prev = 0
        # 获取response的次数
        self.zero_count = -1
        self.task = task.LoopingCall(self.increment, spider)
        self.task.start(self.interval)
    
    def increment(self, spider):
        # 判断爬虫的request是否为空
        request_count = self.stats.get_value('downloader/request_count', 0)
        # 这一次的数据量 - 上一次的数据量
        inc = (request_count - self.request_count_prev)
        
        if inc == 0:
            self.zero_count += 1
        elif inc != 0 and self.zero_count != 0:
            self.zero_count = 0
        
        # 如果为增量为0的次数超过阈值，则主动关闭爬虫
        if self.zero_count >= self.threshold:
            self.crawler.engine.close_spider(spider, 'closespider_zerocount')
    
    def spider_closed(self, spider, reason):
        if self.task and self.task.running:
            self.task.stop()

liuyuer · 2021-04-20T08:07:39Z

this is my code. it is extensions.

from scrapy.exceptions import NotConfigured
from twisted.internet import task
from scrapy import signals


class AutoCloseSpider(object):
    """
    scrapy_redis扩展插件
    
    Parameters
    ----------
    CLOSE_SPIDER_INTERVAL : float
    
    ZERO_THRESHOLD : int
    """
    
    def __init__(self, crawler, stats, interval=60.0, threshold=3):
        self.crawler = crawler
        self.stats = stats
        self.interval = interval
        self.threshold = threshold
        self.task = None
    
    @classmethod
    def from_crawler(cls, crawler):
        interval = crawler.settings.getfloat('CLOSE_SPIDER_INTERVAL')
        threshold = crawler.settings.getint('ZERO_THRESHOLD')
        
        if not interval and not threshold:
            raise NotConfigured
        
        stats = crawler.stats
        o = cls(crawler, stats, interval, threshold)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
        return o
    
    def spider_opened(self, spider):
        # 记录上一次的数据量
        self.request_count_prev = 0
        # 获取response的次数
        self.zero_count = -1
        self.task = task.LoopingCall(self.increment, spider)
        self.task.start(self.interval)
    
    def increment(self, spider):
        # 判断爬虫的request是否为空
        request_count = self.stats.get_value('downloader/request_count', 0)
        # 这一次的数据量 - 上一次的数据量
        inc = (request_count - self.request_count_prev)
        
        if inc == 0:
            self.zero_count += 1
        elif inc != 0 and self.zero_count != 0:
            self.zero_count = 0
        
        # 如果为增量为0的次数超过阈值，则主动关闭爬虫
        if self.zero_count >= self.threshold:
            self.crawler.engine.close_spider(spider, 'closespider_zerocount')
    
    def spider_closed(self, spider, reason):
        if self.task and self.task.running:
            self.task.stop()

How to restart it after it stopped? If I want to work on a new target.

rmax · 2021-04-20T14:05:54Z

@liuyuer could you expand on your use case?

I like to recycle processes so memory doesn't pile up over time. You could have make you crawler to close after being idle for some time or reaching certain threshold (i.e.: domains scraped, mem usage, etc) and have an external process that monitors at least you have X crawlers running.

liuyuer · 2021-04-20T22:11:59Z

@liuyuer could you expand on your use case?

I like to recycle processes so memory doesn't pile up over time. You could have make you crawler to close after being idle for some time or reaching certain threshold (i.e.: domains scraped, mem usage, etc) and have an external process that monitors at least you have X crawlers running.

My use case is:

Crawler runs as a service, when the crawler reached a threshold, it could stop with self.crawler.engine.close_spider.
The crawler could restart when it received a new target to work on.

My problem was:

The crawler could not restart after if was stopped by self.crawler.engine.close_spider.
I need to clean up the redis keys. So that the result will not mix up.

What I did:

I am using Scrapydo to take care of the new process so I can restart Scrapy(not scrapy-redis) in a new process.

@rmax I am not sure if that is the correct way to handle that. I also worry about the memory issue. If you can share more how you recycle processes/clean up, that will very helpful.

LuckyPigeon added improvement question labels Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When does the crawler stop #182

When does the crawler stop #182

forgeries commented Dec 25, 2020

mkdir700 commented Jan 22, 2021

liuyuer commented Apr 20, 2021

rmax commented Apr 20, 2021

liuyuer commented Apr 20, 2021

When does the crawler stop #182

When does the crawler stop #182

Comments

forgeries commented Dec 25, 2020

mkdir700 commented Jan 22, 2021

liuyuer commented Apr 20, 2021

rmax commented Apr 20, 2021

liuyuer commented Apr 20, 2021