How does the CrawlSpider work? #284

Chris8080 · 2023-06-26T11:25:28Z

Description

Hello,

I'm trying to figure out, that that works.
So far, I've connected my spider to redis with 3 test-domains.
When I start the spider, I can see the first hit to the websites.

What I don't understand now is:
How are the URLs that the LinkExtractor finds fed back into Redis?

And I assume, my cralwer is being "stopped" at:
domain = kwargs.pop('domain', '')
kwargs is always an empty dict.
Where does it come from?

It seems like, I initialize self.allowed_domains with an empty list of domains - so the crawler can't start.
How to do it right?

`
class MyCrawlerSpider(RedisCrawlSpider):
"""Spider that reads urls from redis queue (mycrawler:start_urls)."""
name = "redis_my_crawler"

redis_key = 'mycrawler:start_urls'

rules = (
    Rule(LinkExtractor(), follow=True, process_links="filter_links"),
    Rule(LinkExtractor(), callback='parse_page', follow=True, process_links="filter_links"),
)

def __init__(self, *args, **kwargs):
    # Dynamically define the allowed domains list.
    print('Init')
    print(args)
    print(kwargs)
    domain = kwargs.pop('domain', '')
    print(domain)
    self.allowed_domains = filter(None, domain.split(','))
    print(self.allowed_domains)
    super(MyCrawlerSpider, self).__init__(*args, **kwargs)

def filter_links(self, links):
    allowed_strings = ('news')
    allowed_links = []
    for link in links:
        if (any(s in link.url.lower() for s in allowed_strings)
            and any(domain in link.url for domain in self.allowed_domains)):
            print(link)
            allowed_links.append(link)

    return allowed_links


def parse_page(self, response):
    print(response.url)
    return None

`

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does the CrawlSpider work? #284

How does the CrawlSpider work? #284

Chris8080 commented Jun 26, 2023

How does the CrawlSpider work? #284

How does the CrawlSpider work? #284

Comments

Chris8080 commented Jun 26, 2023

Description