Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running many instances of one spider #255

Open
serpent213 opened this issue Apr 3, 2023 · 3 comments
Open

Running many instances of one spider #255

serpent213 opened this issue Apr 3, 2023 · 3 comments

Comments

@serpent213
Copy link
Contributor

My application requires basically only one spider, but I would like to run many instances in parallel. I was assuming that to be possible using the crawl_id.

But now I'm not so sure anymore, the dispatching seems to be based on the spider's name mainly.

What would it take to make that work?

@oltarasenko
Copy link
Collaborator

Hey @serpent213,

Indeed, Crawly is built around the spider names, and there is no easy way to switch to something else right now.

However, it may be the case that you don't need it. Let me try to explain my points here:

  1. Spider - is not a stateful process; it's just a set of callbacks. So there is no such thing as running 100 instances of the spider.
  2. When we talk about the stateful part of it, it's all about Manager, RequestsStorage (scheduled requests), ItemsStorage (extracted items)

When I think about your case, as I understand it, you want a broad crawl that goes to multiple websites and extracts all the information from them. Probably there is some scheduler outside Crawly that just does something like:

Crawly.Engine.start_spider(BroadCrawler, start_urls: "google.com")
Crawly.Engine.start_spider(BroadCrawler, start_urls: "openai.com")

May it be the case, that you need an API to add extra requests to an already running spider?

@serpent213
Copy link
Contributor Author

When I think about your case, as I understand it, you want a broad crawl that goes to multiple websites and extracts all the information from them. Probably there is some scheduler outside Crawly that just does something like:

Crawly.Engine.start_spider(BroadCrawler, start_urls: "google.com")
Crawly.Engine.start_spider(BroadCrawler, start_urls: "openai.com")

Exactly, that was my first attempt. Thank you for the inspiration, will look into it!

@alexandargyurov
Copy link

API to add extra requests to an already running spider @oltarasenko

How can I do this? I have a spider running and would like to add more urls/requests for it to scrape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants