Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: scrapy-vs-crawlee blog #2431
docs: scrapy-vs-crawlee blog #2431
Changes from 2 commits
cf783be
82ce15c
7c530cb
5428599
4e868a7
e875825
df396af
40c0cff
06225c6
d7084ba
d2e2e85
9e21941
0693ded
87bf335
6f0dd59
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here.
Links:
This comparison is weird because it compares apples and oranges:
Plus Crawlee can just as easily work with large scale projects. Let's keep it factual and logical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this could be a good intro to a generic blog, I think it feels out of place for the Crawlee blog. Let's make it much more personalized. Something along the lines of:
I don't expect you to use this verbatim, I haven't given it tons of thought. I just wanted to accentuate three things:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, let's not reuse marketing claims like "supports efficient scraping from large-scale websites", and stay focused on facts.
Let's not use passive voice unless needed.
If we're making a claim like that, we should provide evidence, or at least say that we will explain it later in the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/evidence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to introduce some methodology for how we selected and compared the features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more detail into how the plugin works would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have code tabs in blog? I think this is a good opportunity to showcase how similar the interfaces are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crawlee doesn't really adjust number of crawler instances, but increases the number of requests that are processed concurrently (in parallel) within one crawler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the example, because while this is how AutoscaledPool works internally, no user ever needs to use this code. It's used internally by the Crawlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison is weird, because it does not really compare the things mentioned in the Scrapy section. In doesn't mention LIFO / FIFO, how to do it, etc. On the other hand, it mentions resuming interrupted tasks (do you mean retrying failed requests?), but there's no mention about that in the Scrapy part, which is not fair towards Scrapy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not optimal way of using proxies. In reality, you will just pass
proxyConfiguration
to a Crawler instance and it will use it automatically and rotate proxies based on SessionPool rules.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code examples could help spice this up. It's kinda shallow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're mentioning specific classes and guides in crawlee, it would be fair to give a bit more detail about Scrapy as well.
And as I mentioned before, make it clear that fingerprinting works even in CheerioCrawler and the other HTTP Crawlers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to say that you don't need to though. Most projects don't use a custom error handler. Plus we also have some custom errors that can be used to handle flow of the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example does not really show an actual example of how to do error handling. Just displays the interface.