-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy "session" extension #3258
Comments
This is a great idea, although adding another argument to the parse callback signature should not be done. That's because you will have to actually edit the inner workings of Scrapy to allow what you are suggesting. It's not as simple as creating a new Spider class I've thought about building a set of middlewares to do what you are trying to do. You must use meta to implement this and I don't think there are any other ways to do it. In fact, meta is so important to Scrapy that a lot of the default middlewares touch the Request/Response meta to implement their logic. I think the best approach would be to make a SessionSpider with a few extra helping methods that can create Session instances that you can later pass on to simple Request instances. Something like calling |
@dmsolow thanks for the clear description. But I'm still wondering what's the main advantage of this new SessionSpider and session concepts? In my understanding, a session is backed by a cookiejar.
A session variable is just a cookiejar index, right? In our project we have a spider middleware that populates the request cookiejar index based on the response cookiejar index, which ensures the new request uses the same session as the response. |
#3563 (comment) is an idea in a similar direction. |
How about the progress now ? I also meet the session problem. |
@lycanthropes There is currently no one working on this feature. |
There is just a official solution to this , I found the solution yesterday. |
@lycanthropes Which solution? |
That is not exactly what the original suggestion is about. If you read the original suggestion above carefully, you’ll see it mentions that solution already (“Instead of passing a cookie jar ID”). |
Do you mean the classification of @lucywang000 ? |
I mean the issue description. |
I'd like to add some notes from internal discussion with @raphapassini about sessions here too:
I don't know what I'm talking about particularly, but maybe a Scheduler could be a good place to start implementing this? I've worked on solutions that wrapped callbacks to juggle queues of requests per session, but there were significant difficulties due to callbacks never running (because of dupe filtering, unexpected errors, etc) and sessions getting into indeterminate state. |
Thoughts: |
@ThomasAitken
It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin
It is possible to reach Unfortunately class Myspider(scrapy.Spider):
def start_requests(self):
downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
self.CookieMiddleware= [middleware for middleware in downloader_middlewares if "CookieMiddleware" in str(type(middleware))][0] With direct access to
Unfortunately this is true. By default scrapy use single
Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user. For rest of cases. I agree that idea is actual. from w3lib.http import basic_auth_header
PROFILES = [
{"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"},
{"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"}
] In order to bind proxy address to cookiejar it is enough to use the same key value for |
Thanks for your feedback. Yes, I understand that the Scrapy Cookiejar is a wrapper around the http Cookiejar, and yes you are right that you can use that trick to access the cookie middleware. But my API offers much nicer syntax, convenient methods for inspecting the specific cookies and functionality to 'refresh' sessions. As for the 'profiles' aspect, yes you are right that you can bind proxy addresses to a cookiejar as you do in that code sample, but there are other benefits to the way I have set things up. It is just a convenient way of setting something up that is normally very difficult in Scrapy. |
I'm interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the
meta
dictionary so much. Basically, the implementation I'm thinking of will be an abstract subclass ofscrapy.Spider
which I will callSessionSpider
. The primary differences will be:Instead of the normal spider parse callback signature
(self, response)
,SessionSpider
will have(self, session, response)
callbacks. Thesession
argument will be some kind ofSession
object that at least keeps track of cookies (and possibly proxies and certain headers).This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run "forever" on an unbounded list of URLs.
A
SessionSpider
callback that wants to create requests with the same session will generate requests using asession.Request
factory method that returns ascrapy.Request
. This method will take care of merging session variables with the new request.I'm hoping to implement most of the features I want by having the
Session
object do themeta
manipulation behind the scenes so thatSessionSpider
subclasses don't have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878
The text was updated successfully, but these errors were encountered: