Optimize SchemeWithRetries timeout handling #2787

HouseK · 2023-10-20T15:40:56Z

2 changes. - We limit the timeout of the context passed to `Fetch` to the time until the next backoff is. - We only sleep up to this timeout and subtract the duration that Fetch ran for. We expect that this change behaves exactly the same for quickly failing `Fetch` calls. But will do faster retries for `Fetch` calls that timeout. Signed-off-by: Kobe Housen <kobehousen@google.com>

codecov · 2023-10-20T16:10:38Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
pkg/curl/schemes.go	`64.51% <90.90%> (+2.01%)`	⬆️

... and 5 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

rminnich

The != vs < is the one I worry about, but I noticed a few other things that bug me a bit.

rminnich · 2023-10-20T18:01:11Z

pkg/curl/schemes.go

-		var r io.ReaderAt
-		// Note: err uses the scope outside the for loop.
-		r, err = s.Scheme.Fetch(ctx, u)
+	for d := back.NextBackOff(); d != backoff.Stop; d = back.NextBackOff() {


should this be < ? I had a problem in a timeout scheme once with != when the timer skipped over the stop value and kept running.

I worry about direct comparison on anything related to time.

Stop is a magical value that does not represent time.

https://pkg.go.dev/github.com/cenkalti/backoff/v4#Stop

I think direct comparison is the better option here.

rminnich · 2023-10-20T18:03:04Z

pkg/curl/schemes.go

@@ -347,7 +346,13 @@ func (s *SchemeWithRetries) Fetch(ctx context.Context, u *url.URL) (io.ReaderAt,
 		if s.DoRetry != nil && !s.DoRetry(u, err) {


I'm always a little concerned when I see these nil comparison, makes me feel like the design is a bit off, but that's not your code ...

A type like Optional[T] would like make it more obvious, but I think the != nil here is used to make this DoRetry an optional value. When nil we retry until success.

rminnich · 2023-10-20T18:04:08Z

pkg/curl/schemes.go

-		var r io.Reader
-		// Note: err uses the scope outside the for loop.
-		r, err = s.Scheme.FetchWithoutCache(ctx, u)
+	for d := back.NextBackOff(); d != backoff.Stop; d = back.NextBackOff() {


same question < instead of !=? Also, not your code, but these duplicated code blocks hint at a possibility that this could might be done differently? Not germane to this PR.

10000TB · 2023-10-20T18:43:39Z

Looks like this PR compensate time to fetch via part of next wait time.

If time to fetch takes forever, and til the deadline, then it looks like we won't wait at all, and go straight to next loop iteration, and immediately start fetch again.

Could you share more details / motivation for this proposed behavior ?

I think there is no one perfect wait behavior that will work perfect for all scenarios. I am concerned this proposal might be inspired by a specific case you are dealing with, and can cause regression for more general cases

10000TB

Left a comment, hope to understand the motivation.

HouseK · 2023-10-21T21:21:05Z

Looks like this PR compensate time to fetch via part of next wait time.

If time to fetch takes forever, and til the deadline, then it looks like we won't wait at all, and go straight to next loop iteration, and immediately start fetch again.

Could you share more details / motivation for this proposed behavior ?

I think there is no one perfect wait behavior that will work perfect for all scenarios. I am concerned this proposal might be inspired by a specific case you are dealing with, and can cause regression for more general cases

This indeed comes from a problem, or non optimal situation where I got in for myself. I did however try to keep the general case in mind.

In my scenario. The call to fetch would time out on the TCP dial every time. The dial timeout was intentionally set to a low number for reasons that were not very clear for myself.

My reasoning for why this change would be ok would be as follows.

Most failures will fail fast, for which this change has near to zero impact. It would be a tiny fraction faster but not noticable.

For failures that take as long as the duration until the next backoff, I assume that this is still a single attempt with a long timeout, that at some point started to hang. Which means it was actually already sleeping and not doing anything. Sleeping longer would actually go against the intend of the configuration.

Additionally passing in a limited timeout to Fetch here means that failing calls get properly canceled which might even reduce resource useage when cancellation is properly used and implemented in the Fetch call.

10000TB · 2023-10-23T21:04:21Z

Thanks, to clarify, can you also share how reduced wait help the internal case ?

HouseK · 2023-10-24T09:14:35Z

The reduced wait time would not actually have much impact on the internal case.

I only proposed this change because I think it would make the behavior of SchemeWithRetries more predictable.

After having thought about this and discussed with a colleague, I'm less convinced about this change.

So if you are not convinced now that this change is an improvement, let's not do it.

10000TB · 2023-11-02T19:34:34Z

Thank you for the update Kobe ! -- no objection for improvement to solve for one more case in general.

Given we are all not convinced, as there is no positive test result backing the change, I am going to adjourn this PR for now.

HouseK force-pushed the retrytimeouts branch from f09b421 to 701a530 Compare October 20, 2023 15:50

rminnich requested a review from 10000TB October 20, 2023 18:00

rminnich reviewed Oct 20, 2023

View reviewed changes

rminnich added the Awaiting author Waiting for new changes or feedback for author. label Oct 20, 2023

10000TB reviewed Oct 20, 2023

View reviewed changes

Merge branch 'main' into retrytimeouts

6855c69

10000TB closed this Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SchemeWithRetries timeout handling #2787

Optimize SchemeWithRetries timeout handling #2787

HouseK commented Oct 20, 2023

codecov bot commented Oct 20, 2023 •

edited

rminnich left a comment

rminnich Oct 20, 2023

HouseK Oct 21, 2023

rminnich Oct 20, 2023

HouseK Oct 21, 2023

rminnich Oct 20, 2023

10000TB commented Oct 20, 2023

10000TB left a comment

HouseK commented Oct 21, 2023

10000TB commented Oct 23, 2023

HouseK commented Oct 24, 2023

10000TB commented Nov 2, 2023 •

edited

		@@ -347,7 +346,13 @@ func (s SchemeWithRetries) Fetch(ctx context.Context, u url.URL) (io.ReaderAt,
		if s.DoRetry != nil && !s.DoRetry(u, err) {

Optimize SchemeWithRetries timeout handling #2787

Optimize SchemeWithRetries timeout handling #2787

Conversation

HouseK commented Oct 20, 2023

codecov bot commented Oct 20, 2023 • edited

Codecov Report

rminnich left a comment

Choose a reason for hiding this comment

rminnich Oct 20, 2023

Choose a reason for hiding this comment

HouseK Oct 21, 2023

Choose a reason for hiding this comment

rminnich Oct 20, 2023

Choose a reason for hiding this comment

HouseK Oct 21, 2023

Choose a reason for hiding this comment

rminnich Oct 20, 2023

Choose a reason for hiding this comment

10000TB commented Oct 20, 2023

10000TB left a comment

Choose a reason for hiding this comment

HouseK commented Oct 21, 2023

10000TB commented Oct 23, 2023

HouseK commented Oct 24, 2023

10000TB commented Nov 2, 2023 • edited

codecov bot commented Oct 20, 2023 •

edited

10000TB commented Nov 2, 2023 •

edited