Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass request delay when request is cancelled #619

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions extensions/random_user_agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -264,10 +264,10 @@ func genChromeUA() string {
// -> "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36 Edg/90.0.818.39"
func genEdgeUA() string {
version := edgeVersions[rand.Intn(len(edgeVersions))]
chrome_version := strings.Split(version, ",")[0]
edge_version := strings.Split(version, ",")[1]
chromeVersion := strings.Split(version, ",")[0]
edgeVersion := strings.Split(version, ",")[1]
os := osStrings[rand.Intn(len(osStrings))]
return fmt.Sprintf("Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36 Edg/%s", os, chrome_version, edge_version)
return fmt.Sprintf("Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36 Edg/%s", os, chromeVersion, edgeVersion)
}

// Generates Opera Browser User-Agent (Desktop)
Expand Down
8 changes: 7 additions & 1 deletion http_backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import (
"crypto/sha1"
"encoding/gob"
"encoding/hex"
"errors"
"io"
"io/ioutil"
"math/rand"
Expand Down Expand Up @@ -170,7 +171,12 @@ func (h *httpBackend) Cache(request *http.Request, bodySize int, checkHeadersFun
func (h *httpBackend) Do(request *http.Request, bodySize int, checkHeadersFunc checkHeadersFunc) (*Response, error) {
r := h.GetMatchingRule(request.URL.Host)
if r != nil {
r.waitChan <- true
select {
case r.waitChan <- true:
case <-request.Context().Done():
return nil, errors.New("context canceled; early return")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe earlier version of this PR returned ctx.Err() instead. Why did you decide to return a new error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the case. Usually i would use wrap from github.com/pkg/errors and just add to the error but the library is not in use here.
My first try at writing the code I was asserting against the time that it took to run all the calls and check if they return asap after the cancel; worked fine locally but fail on CI and yea, check like that is really unreliable.
So my second attempt its based on the error message; to check if the error was the normal error from the request or if it was the early return; I had to make the early return message different from ctx.Err() cause that is what we get from the http client that tries to run with a cancelled request.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. I don't think tests are good enough reason to return different errors (depending on where it happened) when context is cancelled. Are you aware of any other library that does not return ctx.Err() unchanged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, any suggestions on how to get the test done?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first try at writing the code I was asserting against the time that it took to run all the calls and check if they return asap after the cancel; worked fine locally but fail on CI and yea, check like that is really unreliable.

I suppose you could make a test like this:

  • Make the test HTTP handler sleep before returning the response, maybe a second or more.
  • Set low rate limit, say, 1 req/second.
  • Spawn lots of goroutines. Doesn't have to be too much, though, maybe 20-30 or so is enough.
  • Cancel context.
  • Assert that cancellation does happen in 5 seconds. This value seems high enough so the test would be reliable, but low enough so the test won't accidentally pass without the fix.

What do you think?

}

defer func(r *LimitRule) {
randomDelay := time.Duration(0)
if r.RandomDelay != 0 {
Expand Down
95 changes: 95 additions & 0 deletions http_backend_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
package colly

import (
"context"
"fmt"
"math/rand"
"net/http"
"net/http/cookiejar"
"net/http/httptest"
"strconv"
"strings"
"sync"
"testing"
"time"
)

func TestHTTPBackendDoCancelation(t *testing.T) {
// hardcoded fixed to ensure that p, n, c are the same making the test more reliable
rand.Seed(2927496558871806)

// rand up to 10 to not extend the test duration too much
p := 1 + rand.Intn(5) // p: parallel requests
n := p + p*rand.Intn(10) // n: after n, cancel will be called; ensure 1 calls per worker + rand
c := n + p*2 + rand.Intn(10) // c: total number of calls; ensure 2 calls per worker after cancel is called + rand

ts := httptest.NewServer(http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) {
fmt.Fprint(rw, "OK")
}))
defer ts.Close()

checkHeadersFunc := func(req *http.Request, statusCode int, header http.Header) bool { return true }

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

backend := &httpBackend{}
jar, _ := cookiejar.New(nil)
backend.Init(jar)
limit := &LimitRule{
DomainRegexp: ".*",
Parallelism: p,
Delay: 5 * time.Millisecond,
}
backend.Limit(limit)

var wg sync.WaitGroup
wg.Add(c)

out := make(chan error)

for i := 0; i < c; i++ {
go func(i int) {
defer wg.Done()
req, _ := http.NewRequest("GET", ts.URL+"/"+strconv.Itoa(i), nil)
req = req.WithContext(ctx)

_, err := backend.Do(req, 0, checkHeadersFunc)
out <- err
}(i)
}

go func() {
wg.Wait()
close(out)
}()

i := 0
nonEarlyCount := 0
for err := range out {
i++
if i == n {
cancel()
}

if i <= n {
if err != nil {
t.Errorf("no error was expected for the first %d responses; error: %q", n, err)
}
} else {
errStr := ""
if err != nil {
errStr = err.Error()
}

// non early returns are allowed up to the number of maximum allowed concurrent requests;
// bacause those requests could be already running when cancel was called
if !strings.Contains(errStr, "early return") {
if nonEarlyCount > p {
t.Errorf("count of non early return is above the number of maximum allowed concurrent requests; p: %d; n: %d; c: %d", p, n, c)
}
nonEarlyCount++
}
}
}
}