Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: scrapy-vs-crawlee blog #2431

Merged
merged 15 commits into from May 15, 2024
Merged

Conversation

souravjain540
Copy link
Collaborator

@souravjain540 souravjain540 commented Apr 23, 2024

@B4nan please don't merge it yet.

@mnmkng will review it first.

Copy link
Member

@mnmkng mnmkng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the draft Saurav! 🎉 I left comments where appropriate, plus some general comments here below.

  • I like the interlinking with Apify. I think it makes a lot of sense for SEO. We just have to make sure we're not overly pushy and only using links to Apify where it makes sense.
  • code blocks should have language info, so that they get syntax highlighting
  • there are two repeating themes in the content:
    • objective comparison - when we mention feature or capability of X, we should also mention it for Y. We should always strive to be maximally objective and fair to both parties.
    • trying it out - From many parts of the text it's clear that you haven't actually used the features you're talking about. I think this is an absolutely necessary step of writing good developer content. To have hands-on experience with what you're writing about.

Now, I understand that you wanted to have the result fast, but don't worry about spending longer, to learn about both Crawlee and Scrapy and then producing high quality content. It's an investment into the future, because each subsequent article will be easier to write once you actually understand the libraries and can code with them.

Comment on lines 13 to 19
[Web scraping](https://blog.apify.com/what-is-web-scraping/) is the process of extracting and collecting data automatically from websites. Companies use web scraping for various use cases ranging from making data-driven decisions to [feeding LLMs efficient data](https://blog.apify.com/webscraping-ai-data-for-llms/).

Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.

Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.

We believe there are a lot of things that we can compare between Scrapy and Crawlee. This article will be the first part of a series comparing Scrapy and Crawlee on various parameters. In this article, we will go over all the features that both libraries provide.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this could be a good intro to a generic blog, I think it feels out of place for the Crawlee blog. Let's make it much more personalized. Something along the lines of:

Hey Crawlee community members, we're back with another blog post and this time, we will take a look at comparing Crawlee to Scrapy, one of the oldest and most popular web scraping libraries in the world. When does it make sense to use Crawlee? And when should you consider using Scrapy instead? Let's dive in.

...

I don't expect you to use this verbatim, I haven't given it tons of thought. I just wanted to accentuate three things:

  • don't make the Crawlee blog posts sound like generic SEO blog posts, make them home at the Crawlee blog, and make the readers feel like they're part of a community
  • it's ok to be opinionated, but also to give credit to competitors
  • no need to spend time on fluff, we can move to the main message faster. For example, see https://docusaurus.io/blog/releases/3.2, they literally use one sentence as intro in most of their blogs. I think we can be a bit more friendly and conversational, but we should still strive to be concise and to the point.


Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.

Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here.

Links:

  • crawlee does not need a link to itself IMO
  • linking to an Apify article about Scrapy feels dishonest. The link should go to Scrapy directly, if we don't want to look like cheesy marketers.

This comparison is weird because it compares apples and oranges:

Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.

Plus Crawlee can just as easily work with large scale projects. Let's keep it factual and logical.


## Introduction:

Scrapy is an open-source Python-based web scraping framework that extracts data from websites. It supports efficient scraping from large-scale websites. In Scrapy, spiders are created, which are nothing but autonomous scripts to download and process web content. Limitations include not working well with JavaScript heavy websites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, let's not reuse marketing claims like "supports efficient scraping from large-scale websites", and stay focused on facts.

In Scrapy, spiders are created,

Let's not use passive voice unless needed.

Limitations include not working well with JavaScript heavy websites.

If we're making a claim like that, we should provide evidence, or at least say that we will explain it later in the text.


## Language and development environments:

Regarding languages and development environments, Scrapy is written in Python, making it easier for the data science community to integrate it with various tools with Python. While Scrapy offers very detailed documentation, for first-timers, sometimes it's a little difficult to start with Scrapy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for first-timers, sometimes it's a little difficult to start with Scrapy.

/evidence


On the other hand, Crawlee is one of the few web scraping and automation libraries that supports [JavaScript](https://blog.apify.com/tag/javascript/) and [TypeScript](https://blog.apify.com/tag/typescript/). Crawlee also offers Crawlee CLI, which makes it [easy to start](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) with Crawlee for the Node.js developers.

## Feature Comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to introduce some methodology for how we selected and compared the features.

Comment on lines 120 to 121
In Scrapy, handling anti-blocking strategies like IP rotation, user-agent rotation, custom solutions via middleware, and plugins are needed.
Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/docs/guides/avoid-blocking) with zero configuration necessary, fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're mentioning specific classes and guides in crawlee, it would be fair to give a bit more detail about Scrapy as well.

And as I mentioned before, make it clear that fingerprinting works even in CheerioCrawler and the other HTTP Crawlers

Comment on lines 131 to 141
```
const crawler = new PuppeteerCrawler({
// ...
errorHandler: async ({ page, log }, error) => {
// ...
},
requestHandler: async ({ session, page}) => {
// ...
},
});
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example does not really show an actual example of how to do error handling. Just displays the interface.


In Scrapy, you can handle errors using middleware as well as [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, `CloseSpider` can be raised by a spider callback to close the spider.

In Crawlee, you can set up your own `ErrorHandler` like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to say that you don't need to though. Most projects don't use a custom error handler. Plus we also have some custom errors that can be used to handle flow of the program.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
@souravjain540
Copy link
Collaborator Author

@mnmkng I updated the draft with the new changes :)

@souravjain540 souravjain540 requested a review from mnmkng May 7, 2024 05:40
Copy link
Member

@mnmkng mnmkng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Saurav, thanks for the changes. They are big steps in the right direction. Good job! But we'll have to iron out a few wrinkles before we can publish this.

I see two main issues with it:

Always strive to be objective

Some sections are better, some are worse, but I see that you're still trying to promote Crawlee in the blog. Don't do it. This is the Crawlee blog, so we have to be extremely careful to not antagonize Python devs or Scrapy devs. Maybe they're long time Scrapy users and they're checking this to see if they could use Crawlee in some JS project. We have to be fair and objective in our comparisons, focus on facts and refrain from using adjectives that glorify Crawlee or make it sound like it's better than Scrapy in some way. If Crawlee can do something, and Scrapy can't do it. Say exactly that. And not that Crawlee is better because of it, or simpler, or whatever else. The readers can figure that out for themselves, if you present them with all the facts. They're devs.

Compare apples to apples

When you're making comparisons and you choose some feature(s) to talk about, you should compare it with the exact same feature(s) of the other library. If you show how to do FIFO in Scrapy, you should show how to do FIFO in Crawlee. When you show how to set request retries in one, show it for the other as well. Basically, whenever you show an example for one library, show an example how to do the exact same thing with the other one, if possible. If not, say that that feature isn't available. When the examples show different actions, it makes it impossible to compare them and it reduces usefulness of the comparison.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved

Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc.

If you are comfortable with Python and want to work only with it, go with Scrapy. It has very detailed documentation, and it is one of the oldest and most stable libraries in the space, but if you want to explore or are comfortable working with TypeScript or JavaScript, our recommendation is Crawlee. With all the valuable features like a single interface for HTTP requests and headless browsing, making it work well with JavaScript-heavy websites and autoscaling and fingerprint support, it is the best choice for scraping anything and everything from the internet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the best choice for scraping anything and everything from the internet

That's too bold. And in general, let's make objective recommendations in this section, based on the analysis we did and the features.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments about the code example

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
@souravjain540 souravjain540 changed the title docs: adding first draft of the blog docs: scrapy-vs-crawlee blog May 13, 2024
@souravjain540 souravjain540 requested a review from mnmkng May 13, 2024 05:17
Copy link
Member

@mnmkng mnmkng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor nitpicks. We're getting there 👏

I noticed some code style issues. Are you lint:fixing the examples?

After you make the changes, please send it to Dave or Theo for editing, and we can release it after that.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
Copy link
Member

@davidjohnbarton davidjohnbarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various changes.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
souravjain540 and others added 3 commits May 15, 2024 17:51
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>
Copy link
Member

@davidjohnbarton davidjohnbarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more changes

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>
Copy link
Member

@davidjohnbarton davidjohnbarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be it.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>
@souravjain540
Copy link
Collaborator Author

@B4nan, all good to go! :)

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few code style notes, lets resolve them before we merge

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved
@souravjain540
Copy link
Collaborator Author

@B4nan done!

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@B4nan B4nan merged commit 38c0942 into apify:master May 15, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants