apify · B4nan · May 15, 2024 · Apr 23, 2024 · Apr 23, 2024 · May 5, 2024
diff --git a/website/blog/2024/04-23-scrapy-vs-crawlee/index.md b/website/blog/2024/04-23-scrapy-vs-crawlee/index.md
@@ -0,0 +1,160 @@
+---
+slug: scrapy-vs-crawlee
+title: 'Scrapy vs. Crawlee'
+description: 'Which web scraping library is best for you?'
+image: TBD
+author: Saurav Jain
+authorTitle: Developer Community Manager
+authorURL: https://github.com/souravjain540
+authorImageURL: https://avatars.githubusercontent.com/u/53312820?v=4
+authorTwitter: sauain
+---
+
+[Web scraping](https://blog.apify.com/what-is-web-scraping/) is the process of extracting and collecting data automatically from websites. Companies use web scraping for various use cases ranging from making data-driven decisions to [feeding LLMs efficient data](https://blog.apify.com/webscraping-ai-data-for-llms/). 
+
+Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.
+
+Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.
+
+We believe there are a lot of things that we can compare between Scrapy and Crawlee. This article will be the first part of a series comparing Scrapy and Crawlee on various parameters. In this article, we will go over all the features that both libraries provide.
+
+## Introduction:
+
+Scrapy is an open-source Python-based web scraping framework that extracts data from websites. It supports efficient scraping from large-scale websites. In Scrapy, spiders are created, which are nothing but autonomous scripts to download and process web content. Limitations include not working well with JavaScript heavy websites. 
+
+Crawlee is also an open-source library that originated as [Apify SDK](https://docs.apify.com/sdk/js/). It is a modern web scraping library used in JavaScript and TypeScript. It supports traditional HTTP requests and headless browser environments, providing a good approach to scraping from JavaScript-heavy websites.
+
+## Language and development environments:
+
+Regarding languages and development environments, Scrapy is written in Python, making it easier for the data science community to integrate it with various tools with Python. While Scrapy offers very detailed documentation, for first-timers, sometimes it's a little difficult to start with Scrapy.
+
+On the other hand, Crawlee is one of the few web scraping and automation libraries that supports [JavaScript](https://blog.apify.com/tag/javascript/) and [TypeScript](https://blog.apify.com/tag/typescript/). Crawlee also offers Crawlee CLI, which makes it [easy to start](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) with Crawlee for the Node.js developers.
+
+## Feature Comparison
+
+### Headless Browsing
+
+Scrapy does not support headless browsers natively, but it supports them with its plugin system, one of the best examples of which is its [Playwright plugin](https://github.com/scrapy-plugins/scrapy-playwright/tree/main).
+
+Crawlee, on the other hand, offers a unified interface for HTTP requests and [headless browsing](https://crawlee.dev/docs/guides/javascript-rendering#headless-browsers) using [Puppeteer](https://blog.apify.com/puppeteer-web-scraping-tutorial/) or [Playwright](https://github.com/microsoft/playwright). This integration allows developers to easily switch between simple HTTP scraping and complex browser-based scraping within the same framework, simplifying the handling of dynamic JavaScript content.
+
+### Autoscaling Support
+
+Scrapy does not have autoscaling capabilities inbuilt, but it can be done using external services like Scrapyd or deployed in a distributed manner with Scrapy Cluster.
+
+Crawlee has [built-in autoscaling](https://crawlee.dev/api/core/class/AutoscaledPool) with `AutoscaledPool`, which automatically adjusts the number of running crawler instances based on CPU and memory usage, optimizing resource allocation.
+
+The example usage here: 
+
+```
+const pool = new AutoscaledPool({
+    maxConcurrency: 50,
+    runTaskFunction: async () => {
+        // Run some resource-intensive asynchronous operation here.
+    },
+    isTaskReadyFunction: async () => {
+        // Tell the pool whether more tasks are ready to be processed.
+        // Return true or false
+    },
+    isFinishedFunction: async () => {
+        // Tell the pool whether it should finish
+        // or wait for more tasks to become available.
+        // Return true or false
+    }
+});
+
+await pool.run();
+```
+
+### Queue Management
+
+Scrapy supports both breadth-first and depth-first crawling strategies using a disk-based queuing system. By default, it uses the LIFO queue for the pending requests, which means it is using depth-first order, but if you want to use breadth-first order, you can simply do it by changing these settings:
+
+```
+DEPTH_PRIORITY = 1 
+SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue" SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue"
+```
+
+Crawlee offers [advanced queue management](https://crawlee.dev/api/core/class/RequestQueue) through `RequestQueue` that automatically handles persistence and can resume interrupted tasks, which is suitable for long-term and large-scale crawls.
+
+### CLI Support
+
+Scrapy has a [powerful command-line interface](https://docs.scrapy.org/en/latest/topics/commands.html#command-line-tool) that offers functionalities like starting a project, generating spiders, and controlling the crawling process.
+
+Scrapy CLI comes with scrapy installation. Just run this command, and you are good to go:
+
+
+`pip install scrapy`
+
+Crawlee also [includes a CLI tool](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) (`crawlee-cli`) that facilitates project setup, crawler creation, and execution, streamlining the development process for users familiar with Node.js environments. The command for installation is: 
+
+
+`npx crawlee create my-crawler`
+
+### Proxy Rotation and Storage Management
+
+Scrapy handles it via [custom middleware](https://pypi.org/project/scrapy-rotating-proxies/) or plugins, which requires additional development effort. You have to install their `scrapy-rotating-proxies` package using pip. You can add your proxies to the `ROTATING_PROXY_LIST` or add a path of your rotating proxied to the `ROTATING_PROXY_LIST_PATH`.
+
+In Crawlee, you can [use your own proxy servers](https://crawlee.dev/docs/guides/proxy-management) or proxy servers acquired from third-party providers. If you already have your proxy URLs, you can start using them as easy as that:
+
+```
+import { ProxyConfiguration } from 'crawlee';
+
+const proxyConfiguration = new ProxyConfiguration({
+    proxyUrls: [
+        'http://proxy-1.com',
+        'http://proxy-2.com',
+    ]
+});
+const proxyUrl = await proxyConfiguration.newUrl();
+```
+
+### Data Storage
+
+Scrapy provides data pipelines, allowing easy integration with various storage solutions (local files, databases, cloud services) through custom item pipelines.
+
+Crawlee has [simple data storage](https://blog.apify.com/crawlee-data-storage-types/) solutions and can be extended with custom plugins for storing data in multiple formats and locations.
+
+### Anti-blocking and Fingerprints
+
+In Scrapy, handling anti-blocking strategies like IP rotation, user-agent rotation, custom solutions via middleware, and plugins are needed.
+Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/docs/guides/avoid-blocking) with zero configuration necessary, fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`.
+
+### Error handling
+
+Both libraries support error-handling practices like automatic retries, logging, and custom error handling.
+
+In Scrapy, you can handle errors using middleware as well as [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, `CloseSpider` can be raised by a spider callback to close the spider.
+
+In Crawlee, you can set up your own `ErrorHandler` like this: 
+
+```
+const crawler = new PuppeteerCrawler({
+    // ...
+    errorHandler: async ({ page, log }, error) => {
+        // ...        
+    },
+    requestHandler: async ({ session, page}) => {
+        // ...
+    },
+});
+```
+
+### Deployment using Docker
+
+Scrapy can be containerized using Docker, though it typically requires manual setup to create Dockerfiles and configure environments. While Crawlee includes [ready-to-use Docker configurations](https://crawlee.dev/docs/guides/docker-images), making deployment straightforward across various environments without additional configuration.
+
+## Community 
+
+Both of the projects are open source. Scrapy benefits from a large and well-established community. It has been around since 2008 and has garnered significant attention and usage among developers, particularly those in the Python ecosystem. 
+
+Crawlee started its journey as Apify SDK in 2021. It now has more than [12000 stars on GitHub](https://github.com/apify/crawlee) and a community of more than 7000 developers in their [Discord Community](https://apify.com/discord), used by TypeScript and JavaScript community.
+
+
+## Conclusion
+
+Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc. 
+
+As promised, this is just the first of the many articles comparing Scrapy and Crawlee. With the upcoming articles, you will learn more in-depth about every specific technical detail. 
+
+Meanwhile, if you want to learn more about Crawlee, you can visit this article to learn how to scrape Amazon products using Crawlee.