Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: scrapy-vs-crawlee blog #2431

Merged
merged 15 commits into from
May 15, 2024
160 changes: 160 additions & 0 deletions website/blog/2024/04-23-scrapy-vs-crawlee/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
---
slug: scrapy-vs-crawlee
title: 'Scrapy vs. Crawlee'
description: 'Which web scraping library is best for you?'
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved
image: TBD
author: Saurav Jain
authorTitle: Developer Community Manager
authorURL: https://github.com/souravjain540
authorImageURL: https://avatars.githubusercontent.com/u/53312820?v=4
authorTwitter: sauain
---

[Web scraping](https://blog.apify.com/what-is-web-scraping/) is the process of extracting and collecting data automatically from websites. Companies use web scraping for various use cases ranging from making data-driven decisions to [feeding LLMs efficient data](https://blog.apify.com/webscraping-ai-data-for-llms/).

Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.

Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here.

Links:

  • crawlee does not need a link to itself IMO
  • linking to an Apify article about Scrapy feels dishonest. The link should go to Scrapy directly, if we don't want to look like cheesy marketers.

This comparison is weird because it compares apples and oranges:

Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.

Plus Crawlee can just as easily work with large scale projects. Let's keep it factual and logical.


We believe there are a lot of things that we can compare between Scrapy and Crawlee. This article will be the first part of a series comparing Scrapy and Crawlee on various parameters. In this article, we will go over all the features that both libraries provide.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this could be a good intro to a generic blog, I think it feels out of place for the Crawlee blog. Let's make it much more personalized. Something along the lines of:

Hey Crawlee community members, we're back with another blog post and this time, we will take a look at comparing Crawlee to Scrapy, one of the oldest and most popular web scraping libraries in the world. When does it make sense to use Crawlee? And when should you consider using Scrapy instead? Let's dive in.

...

I don't expect you to use this verbatim, I haven't given it tons of thought. I just wanted to accentuate three things:

  • don't make the Crawlee blog posts sound like generic SEO blog posts, make them home at the Crawlee blog, and make the readers feel like they're part of a community
  • it's ok to be opinionated, but also to give credit to competitors
  • no need to spend time on fluff, we can move to the main message faster. For example, see https://docusaurus.io/blog/releases/3.2, they literally use one sentence as intro in most of their blogs. I think we can be a bit more friendly and conversational, but we should still strive to be concise and to the point.


## Introduction:

Scrapy is an open-source Python-based web scraping framework that extracts data from websites. It supports efficient scraping from large-scale websites. In Scrapy, spiders are created, which are nothing but autonomous scripts to download and process web content. Limitations include not working well with JavaScript heavy websites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, let's not reuse marketing claims like "supports efficient scraping from large-scale websites", and stay focused on facts.

In Scrapy, spiders are created,

Let's not use passive voice unless needed.

Limitations include not working well with JavaScript heavy websites.

If we're making a claim like that, we should provide evidence, or at least say that we will explain it later in the text.


Crawlee is also an open-source library that originated as [Apify SDK](https://docs.apify.com/sdk/js/). It is a modern web scraping library used in JavaScript and TypeScript. It supports traditional HTTP requests and headless browser environments, providing a good approach to scraping from JavaScript-heavy websites.

## Language and development environments:

Regarding languages and development environments, Scrapy is written in Python, making it easier for the data science community to integrate it with various tools with Python. While Scrapy offers very detailed documentation, for first-timers, sometimes it's a little difficult to start with Scrapy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for first-timers, sometimes it's a little difficult to start with Scrapy.

/evidence


On the other hand, Crawlee is one of the few web scraping and automation libraries that supports [JavaScript](https://blog.apify.com/tag/javascript/) and [TypeScript](https://blog.apify.com/tag/typescript/). Crawlee also offers Crawlee CLI, which makes it [easy to start](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) with Crawlee for the Node.js developers.

## Feature Comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to introduce some methodology for how we selected and compared the features.


### Headless Browsing

Scrapy does not support headless browsers natively, but it supports them with its plugin system, one of the best examples of which is its [Playwright plugin](https://github.com/scrapy-plugins/scrapy-playwright/tree/main).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more detail into how the plugin works would be nice.


Crawlee, on the other hand, offers a unified interface for HTTP requests and [headless browsing](https://crawlee.dev/docs/guides/javascript-rendering#headless-browsers) using [Puppeteer](https://blog.apify.com/puppeteer-web-scraping-tutorial/) or [Playwright](https://github.com/microsoft/playwright). This integration allows developers to easily switch between simple HTTP scraping and complex browser-based scraping within the same framework, simplifying the handling of dynamic JavaScript content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have code tabs in blog? I think this is a good opportunity to showcase how similar the interfaces are.


### Autoscaling Support

Scrapy does not have autoscaling capabilities inbuilt, but it can be done using external services like Scrapyd or deployed in a distributed manner with Scrapy Cluster.

Crawlee has [built-in autoscaling](https://crawlee.dev/api/core/class/AutoscaledPool) with `AutoscaledPool`, which automatically adjusts the number of running crawler instances based on CPU and memory usage, optimizing resource allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crawlee doesn't really adjust number of crawler instances, but increases the number of requests that are processed concurrently (in parallel) within one crawler.


The example usage here:

```
const pool = new AutoscaledPool({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the example, because while this is how AutoscaledPool works internally, no user ever needs to use this code. It's used internally by the Crawlers.

maxConcurrency: 50,
runTaskFunction: async () => {
// Run some resource-intensive asynchronous operation here.
},
isTaskReadyFunction: async () => {
// Tell the pool whether more tasks are ready to be processed.
// Return true or false
},
isFinishedFunction: async () => {
// Tell the pool whether it should finish
// or wait for more tasks to become available.
// Return true or false
}
});

await pool.run();
```

### Queue Management

Scrapy supports both breadth-first and depth-first crawling strategies using a disk-based queuing system. By default, it uses the LIFO queue for the pending requests, which means it is using depth-first order, but if you want to use breadth-first order, you can simply do it by changing these settings:

```
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue" SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue"
```

Crawlee offers [advanced queue management](https://crawlee.dev/api/core/class/RequestQueue) through `RequestQueue` that automatically handles persistence and can resume interrupted tasks, which is suitable for long-term and large-scale crawls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison is weird, because it does not really compare the things mentioned in the Scrapy section. In doesn't mention LIFO / FIFO, how to do it, etc. On the other hand, it mentions resuming interrupted tasks (do you mean retrying failed requests?), but there's no mention about that in the Scrapy part, which is not fair towards Scrapy.


### CLI Support

Scrapy has a [powerful command-line interface](https://docs.scrapy.org/en/latest/topics/commands.html#command-line-tool) that offers functionalities like starting a project, generating spiders, and controlling the crawling process.

Scrapy CLI comes with scrapy installation. Just run this command, and you are good to go:


`pip install scrapy`
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved

Crawlee also [includes a CLI tool](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) (`crawlee-cli`) that facilitates project setup, crawler creation, and execution, streamlining the development process for users familiar with Node.js environments. The command for installation is:


`npx crawlee create my-crawler`

### Proxy Rotation and Storage Management

Scrapy handles it via [custom middleware](https://pypi.org/project/scrapy-rotating-proxies/) or plugins, which requires additional development effort. You have to install their `scrapy-rotating-proxies` package using pip. You can add your proxies to the `ROTATING_PROXY_LIST` or add a path of your rotating proxied to the `ROTATING_PROXY_LIST_PATH`.
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved

In Crawlee, you can [use your own proxy servers](https://crawlee.dev/docs/guides/proxy-management) or proxy servers acquired from third-party providers. If you already have your proxy URLs, you can start using them as easy as that:
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved

```
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy-1.com',
'http://proxy-2.com',
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved
]
});
const proxyUrl = await proxyConfiguration.newUrl();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not optimal way of using proxies. In reality, you will just pass proxyConfiguration to a Crawler instance and it will use it automatically and rotate proxies based on SessionPool rules.

```

### Data Storage

souravjain540 marked this conversation as resolved.
Show resolved Hide resolved
Scrapy provides data pipelines, allowing easy integration with various storage solutions (local files, databases, cloud services) through custom item pipelines.

Crawlee has [simple data storage](https://blog.apify.com/crawlee-data-storage-types/) solutions and can be extended with custom plugins for storing data in multiple formats and locations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code examples could help spice this up. It's kinda shallow.


### Anti-blocking and Fingerprints
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved

In Scrapy, handling anti-blocking strategies like IP rotation, user-agent rotation, custom solutions via middleware, and plugins are needed.
Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/docs/guides/avoid-blocking) with zero configuration necessary, fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're mentioning specific classes and guides in crawlee, it would be fair to give a bit more detail about Scrapy as well.

And as I mentioned before, make it clear that fingerprinting works even in CheerioCrawler and the other HTTP Crawlers


### Error handling

Both libraries support error-handling practices like automatic retries, logging, and custom error handling.

In Scrapy, you can handle errors using middleware as well as [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, `CloseSpider` can be raised by a spider callback to close the spider.

In Crawlee, you can set up your own `ErrorHandler` like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to say that you don't need to though. Most projects don't use a custom error handler. Plus we also have some custom errors that can be used to handle flow of the program.


```
const crawler = new PuppeteerCrawler({
// ...
errorHandler: async ({ page, log }, error) => {
// ...
},
requestHandler: async ({ session, page}) => {
// ...
},
});
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example does not really show an actual example of how to do error handling. Just displays the interface.


### Deployment using Docker

Scrapy can be containerized using Docker, though it typically requires manual setup to create Dockerfiles and configure environments. While Crawlee includes [ready-to-use Docker configurations](https://crawlee.dev/docs/guides/docker-images), making deployment straightforward across various environments without additional configuration.

## Community

Both of the projects are open source. Scrapy benefits from a large and well-established community. It has been around since 2008 and has garnered significant attention and usage among developers, particularly those in the Python ecosystem.
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved

Crawlee started its journey as Apify SDK in 2021. It now has more than [12000 stars on GitHub](https://github.com/apify/crawlee) and a community of more than 7000 developers in their [Discord Community](https://apify.com/discord), used by TypeScript and JavaScript community.
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved


## Conclusion
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved

Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc.

As promised, this is just the first of the many articles comparing Scrapy and Crawlee. With the upcoming articles, you will learn more in-depth about every specific technical detail.

Meanwhile, if you want to learn more about Crawlee, you can visit this article to learn how to scrape Amazon products using Crawlee.
souravjain540 marked this conversation as resolved.
Show resolved Hide resolved