Native support for X-Crawlera-Session #27

dchaplinsky · 2016-10-29T00:30:47Z

It'd be great if the plugin can be configured that it'll use/re-use the sessions mechanism.

Because managing it in spiders like that:

        if 'X-Crawlera-Session' in response.headers and response.headers['X-Crawlera-Session'] != self.session_id:
            self.session_id = response.headers['X-Crawlera-Session']
            logger.debug(
                "Got new session id from Crawlera: {}".format(self.session_id))

is a little bit ugly

The text was updated successfully, but these errors were encountered:

eLRuLL · 2016-10-29T00:58:48Z

you mean so you don't have to include X-Crawlera-Session yourself in the next request? Because the example you are passing doesn't seem to do that, but it is a failsafe for when you lost the crawlera session for external reasons to scrapy (but Crawlera itself).

On the other hand, I don't think Crawlera returns a different session by itself, so I can't think on a way this could happen, could you maybe deep into your example?

dchaplinsky · 2016-10-29T12:51:13Z

Here is how I do it:

class FoobarSpider(scrapy.Spider):
    def __init__(self, *args, **kwargs):
        self.session_id = "create"

    def parse(self, response):
        if 'X-Crawlera-Session' in response.headers:
            self.session_id = response.headers['X-Crawlera-Session']

        new_url = ....

        yield scrapy.Request(
            new_url,
            headers={
                'X-Crawlera-Max-Retries': 5,
                'X-Crawlera-Session': self.session_id
            }
        )

Also, I need to watch if request returned 503 (which for my case means that server banned me) and then set self.session_id = "create"

Rather than that I'd like to do something like:
CRAWLERA_ENABLED = True
CRAWLERA_SESSIONS = True
CRAWLERA_RESTART_SESSION_ON = [503]

eLRuLL · 2016-11-17T11:37:59Z

@scrapy-plugins/core any ideas on enabling this in the middleware? I am not really a big fan of using Crawlera sessions with Scrapy, specially when only dealing with 1 Session.

redapple · 2016-11-17T11:45:40Z

@dchaplinsky , @eLRuLL : I believe this can be done at middleware level indeed, perhaps with the same design as the revamped robotstxt middleware, i.e. returning a deferred on process_request if robots.txt has not been seen yet for a domain.

Crawlera has a /sessions (POST) endpoint to ask for a new session.

dchaplinsky · 2016-11-18T00:09:16Z

Well, I'm also not a big fan to use sessions but for some sites it's the only option I have.

eLRuLL · 2018-12-24T23:21:38Z

This is a very old problem, but I would like to update it to see if we should work on it, but I really see two problems here:

Keep passing the Crawlera session header to subsequent requests: I don't think this problem should be handled by this Middleware, as it would be even better to create some kind of StickyHeadersMiddleware when we keep passing any kind of request header to any request, regarding if this is Crawlera or not.
Create a new session when the previous session was banned: This I see being very specific to this particular problem, as I don't recommend creating a new Crawlera Session once the previous was banned.

Crawlera Sessions only make sense when we need to crawl a site from one specific proxy, and commonly also same cookies. This should also mean that one request should come after the other, which also means that the spider should be very slow (for that particular request thread).

For example, Let say we have a request thread of 5 requests (one request after the other), and the Crawlera Session fails on the 3rd request, what should we do at that point? Ask for a new Crawlera Session and retry the 3rd request? For me, it feels that we would have to start from the 1st request, to have a consistent functionality, because restarting at any other point with a new Crawlera Session means we never needed a Crawlera Session on the first place.

@scrapy-plugins/core Could someone also please share some opinions respective to this particular problem (mostly point 2)? Thanks

immerrr · 2018-12-25T06:28:05Z

Re: StickyHeadersMiddleware: not all headers should be sticky, there's a set of headers that are sent with usual browsers, so I believe there is a place for some IAmABrowser middleware, where one would generate and store the standard headers to use with the session that bring the site-facing behaviour as close to browser-like as possible.

Re: slowness: it depends on the behaviour the scraper wants to follow: if it is to mimic user behaviour, then yes, one request should come after the other, but there are websites that run with a lot of IP/cookie sensitive XHR requests: naturally, the scraper is free to run them faster one after the other and even concurrently, again, to follow closely what happens in real world.

Re: resetting
I had success with implementing a sticky crawlera sessions middleware a while ago while broad crawling a specific site where it would first send you a rather simple js challenge, and after you passed it and as long as you kept the IP address, cookies & headers the same you would be able to do few hundreds of requests. Then the site would get suspicious again and send the "session" to recaptcha, at which point I just reset the session, started over and relied on a sizeable Crawlera pool to cycle nodes.

Re: ordering
When banned I believe you might want to reset the request sequence up to the latest URL that can reasonably be accessed by a user, that is to follow what would happen in a situation where a human wanted to click through a list of URLs, so continuing mid-way through some cursor-based pagination might not work, but restarting at a URL available via Google or some sort of a site-internal catalogue should be fine. So, going back to the broad crawler, I had the crawler restart at the website's index page and continue from there until the next ban.

brooj095 · 2019-01-22T02:06:22Z

@dchaplinsky did you find a solve for the robotstxt middleware? Could you post your solution?

I am looking to develop a solution for the same issue, but am having trouble passing session_id flags/var between the middleware and spider.

dchaplinsky · 2019-01-23T22:15:01Z

@brooj095 , to be honest, I barely remember the issue I've been working on.

Gallaecio · 2019-05-22T15:38:28Z

Going back to @eLRuLL’s comment, I don’t see how we could improve things for point 2, so I think we should instead work on point 1, and I think that’s made of two parts:

Add sticky header support to Scrapy. We’ve recently received a pull request for sticky meta keys, and I think a similar implementation for headers would make sense. See add sticky meta spider middleware scrapy/scrapy#3770
Once that is implemented, improve the scrapy-crawlera documentation to suggest taking advantage of that Scrapy feature. Even though there would be no scrapy-crawlera code for it, I think scrapy-crawlera’s documentation is the perfect place to cover this information.

Gallaecio · 2019-11-28T14:38:32Z

Should we consider #85 a fix for this?

redapple added help wanted enhancement labels Jan 18, 2017

eLRuLL pinned this issue Dec 25, 2018

Gallaecio mentioned this issue Nov 28, 2019

Support easy session reuse #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for X-Crawlera-Session #27

Native support for X-Crawlera-Session #27

dchaplinsky commented Oct 29, 2016 •

edited

eLRuLL commented Oct 29, 2016

dchaplinsky commented Oct 29, 2016

eLRuLL commented Nov 17, 2016

redapple commented Nov 17, 2016

dchaplinsky commented Nov 18, 2016

eLRuLL commented Dec 24, 2018

immerrr commented Dec 25, 2018

brooj095 commented Jan 22, 2019

dchaplinsky commented Jan 23, 2019

Gallaecio commented May 22, 2019

Gallaecio commented Nov 28, 2019

Native support for X-Crawlera-Session #27

Native support for X-Crawlera-Session #27

Comments

dchaplinsky commented Oct 29, 2016 • edited

eLRuLL commented Oct 29, 2016

dchaplinsky commented Oct 29, 2016

eLRuLL commented Nov 17, 2016

redapple commented Nov 17, 2016

dchaplinsky commented Nov 18, 2016

eLRuLL commented Dec 24, 2018

immerrr commented Dec 25, 2018

brooj095 commented Jan 22, 2019

dchaplinsky commented Jan 23, 2019

Gallaecio commented May 22, 2019

Gallaecio commented Nov 28, 2019

dchaplinsky commented Oct 29, 2016 •

edited