Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird async behaviour - duplicates in responses #802

Open
AlexS778 opened this issue Jan 5, 2024 · 1 comment
Open

Weird async behaviour - duplicates in responses #802

AlexS778 opened this issue Jan 5, 2024 · 1 comment

Comments

@AlexS778
Copy link

AlexS778 commented Jan 5, 2024

Hello guys, recently I was using crawler to crawl some stuff and it was taking quite a lot of time, so I decided to use async mode. While using the async mode I've noticed a lot of duplicates in my results, especially number of duplicates was matching the number of threads I was launching my crawler.

Here is a quick example, let's take an example from official docs - https://github.com/gocolly/colly/blob/master/_examples/rate_limit/rate_limit.go

func main() {
	url := "https://httpbin.org/delay/2"

	// Instantiate default collector
	c := colly.NewCollector(
		// Turn on asynchronous requests
		colly.Async(true),
	)

	// Start scraping in five threads on https://httpbin.org/delay/2
	for i := 0; i < 5; i++ {

		c.OnResponse(func(response *colly.Response) {
			fmt.Println(string(response.Body))
		})

		c.Visit(fmt.Sprintf("%s?n=%d", url, i))
	}
	// Wait until threads are finished
	c.Wait()
}

If we would launch this code, we can see the results:

A lot of text here with http body response
{
"args": {
  "n": "3"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=3"
}

{
"args": {
  "n": "3"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=3"
}

{
"args": {
  "n": "3"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=3"
}

{
"args": {
  "n": "1"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=1"
}

{
"args": {
  "n": "1"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=1"
}

{
"args": {
  "n": "1"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=1"
}

{
"args": {
  "n": "1"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=1"
}

{
"args": {
  "n": "1"
}, 
"data": "", 
"files": {}, 
"form": {}, 
"headers": {
  "Accept": "*/*", 
  "Accept-Encoding": "gzip", 
  "Host": "httpbin.org", 
  "User-Agent": "colly - https://github.com/gocolly/colly/v2", 
  "X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
}, 
"origin": "83.139.137.160", 
"url": "https://httpbin.org/delay/2?n=1"
}

As you can see, there are duplicates in results. Maybe I'm doing something wrong, not setting up crawler properly, but still I highly doubt if this is a intended behaviour. Anyways, would appreciate any help.

@hugokung
Copy link

Because c.OnResponse is executed 5 times in the loop, and each time the incoming parameters are added to c.responseCallbacks in the form of an append, each goroutine executes all the functions in c.responseCallbacks when it completes the request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants