Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native support for X-Crawlera-Session #27

Open
dchaplinsky opened this issue Oct 29, 2016 · 11 comments
Open

Native support for X-Crawlera-Session #27

dchaplinsky opened this issue Oct 29, 2016 · 11 comments

Comments

@dchaplinsky
Copy link

dchaplinsky commented Oct 29, 2016

It'd be great if the plugin can be configured that it'll use/re-use the sessions mechanism.

Because managing it in spiders like that:

        if 'X-Crawlera-Session' in response.headers and response.headers['X-Crawlera-Session'] != self.session_id:
            self.session_id = response.headers['X-Crawlera-Session']
            logger.debug(
                "Got new session id from Crawlera: {}".format(self.session_id))

is a little bit ugly

@eLRuLL
Copy link
Member

eLRuLL commented Oct 29, 2016

you mean so you don't have to include X-Crawlera-Session yourself in the next request? Because the example you are passing doesn't seem to do that, but it is a failsafe for when you lost the crawlera session for external reasons to scrapy (but Crawlera itself).

On the other hand, I don't think Crawlera returns a different session by itself, so I can't think on a way this could happen, could you maybe deep into your example?

@dchaplinsky
Copy link
Author

Here is how I do it:

class FoobarSpider(scrapy.Spider):
    def __init__(self, *args, **kwargs):
        self.session_id = "create"

    def parse(self, response):
        if 'X-Crawlera-Session' in response.headers:
            self.session_id = response.headers['X-Crawlera-Session']

        new_url = ....

        yield scrapy.Request(
            new_url,
            headers={
                'X-Crawlera-Max-Retries': 5,
                'X-Crawlera-Session': self.session_id
            }
        )

Also, I need to watch if request returned 503 (which for my case means that server banned me) and then set self.session_id = "create"

Rather than that I'd like to do something like:
CRAWLERA_ENABLED = True
CRAWLERA_SESSIONS = True
CRAWLERA_RESTART_SESSION_ON = [503]

@eLRuLL
Copy link
Member

eLRuLL commented Nov 17, 2016

@scrapy-plugins/core any ideas on enabling this in the middleware? I am not really a big fan of using Crawlera sessions with Scrapy, specially when only dealing with 1 Session.

@redapple
Copy link
Contributor

@dchaplinsky , @eLRuLL : I believe this can be done at middleware level indeed, perhaps with the same design as the revamped robotstxt middleware, i.e. returning a deferred on process_request if robots.txt has not been seen yet for a domain.

Crawlera has a /sessions (POST) endpoint to ask for a new session.

@dchaplinsky
Copy link
Author

Well, I'm also not a big fan to use sessions but for some sites it's the only option I have.

@eLRuLL
Copy link
Member

eLRuLL commented Dec 24, 2018

This is a very old problem, but I would like to update it to see if we should work on it, but I really see two problems here:

  1. Keep passing the Crawlera session header to subsequent requests: I don't think this problem should be handled by this Middleware, as it would be even better to create some kind of StickyHeadersMiddleware when we keep passing any kind of request header to any request, regarding if this is Crawlera or not.

  2. Create a new session when the previous session was banned: This I see being very specific to this particular problem, as I don't recommend creating a new Crawlera Session once the previous was banned.

    Crawlera Sessions only make sense when we need to crawl a site from one specific proxy, and commonly also same cookies. This should also mean that one request should come after the other, which also means that the spider should be very slow (for that particular request thread).

    For example, Let say we have a request thread of 5 requests (one request after the other), and the Crawlera Session fails on the 3rd request, what should we do at that point? Ask for a new Crawlera Session and retry the 3rd request? For me, it feels that we would have to start from the 1st request, to have a consistent functionality, because restarting at any other point with a new Crawlera Session means we never needed a Crawlera Session on the first place.

@scrapy-plugins/core Could someone also please share some opinions respective to this particular problem (mostly point 2)? Thanks

@immerrr
Copy link

immerrr commented Dec 25, 2018

Re: StickyHeadersMiddleware: not all headers should be sticky, there's a set of headers that are sent with usual browsers, so I believe there is a place for some IAmABrowser middleware, where one would generate and store the standard headers to use with the session that bring the site-facing behaviour as close to browser-like as possible.

Re: slowness: it depends on the behaviour the scraper wants to follow: if it is to mimic user behaviour, then yes, one request should come after the other, but there are websites that run with a lot of IP/cookie sensitive XHR requests: naturally, the scraper is free to run them faster one after the other and even concurrently, again, to follow closely what happens in real world.

Re: resetting
I had success with implementing a sticky crawlera sessions middleware a while ago while broad crawling a specific site where it would first send you a rather simple js challenge, and after you passed it and as long as you kept the IP address, cookies & headers the same you would be able to do few hundreds of requests. Then the site would get suspicious again and send the "session" to recaptcha, at which point I just reset the session, started over and relied on a sizeable Crawlera pool to cycle nodes.

Re: ordering
When banned I believe you might want to reset the request sequence up to the latest URL that can reasonably be accessed by a user, that is to follow what would happen in a situation where a human wanted to click through a list of URLs, so continuing mid-way through some cursor-based pagination might not work, but restarting at a URL available via Google or some sort of a site-internal catalogue should be fine. So, going back to the broad crawler, I had the crawler restart at the website's index page and continue from there until the next ban.

@eLRuLL eLRuLL pinned this issue Dec 25, 2018
@brooj095
Copy link

@dchaplinsky did you find a solve for the robotstxt middleware? Could you post your solution?

I am looking to develop a solution for the same issue, but am having trouble passing session_id flags/var between the middleware and spider.

@dchaplinsky
Copy link
Author

@brooj095 , to be honest, I barely remember the issue I've been working on.

@Gallaecio
Copy link
Contributor

Going back to @eLRuLL’s comment, I don’t see how we could improve things for point 2, so I think we should instead work on point 1, and I think that’s made of two parts:

  1. Add sticky header support to Scrapy. We’ve recently received a pull request for sticky meta keys, and I think a similar implementation for headers would make sense. See add sticky meta spider middleware scrapy/scrapy#3770

  2. Once that is implemented, improve the scrapy-crawlera documentation to suggest taking advantage of that Scrapy feature. Even though there would be no scrapy-crawlera code for it, I think scrapy-crawlera’s documentation is the perfect place to cover this information.

@Gallaecio
Copy link
Contributor

Should we consider #85 a fix for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants