-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize SchemeWithRetries timeout handling #2787
Conversation
2 changes. - We limit the timeout of the context passed to `Fetch` to the time until the next backoff is. - We only sleep up to this timeout and subtract the duration that Fetch ran for. We expect that this change behaves exactly the same for quickly failing `Fetch` calls. But will do faster retries for `Fetch` calls that timeout. Signed-off-by: Kobe Housen <kobehousen@google.com>
Codecov ReportAttention:
... and 5 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The != vs < is the one I worry about, but I noticed a few other things that bug me a bit.
var r io.ReaderAt | ||
// Note: err uses the scope outside the for loop. | ||
r, err = s.Scheme.Fetch(ctx, u) | ||
for d := back.NextBackOff(); d != backoff.Stop; d = back.NextBackOff() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be < ? I had a problem in a timeout scheme once with != when the timer skipped over the stop value and kept running.
I worry about direct comparison on anything related to time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop is a magical value that does not represent time.
https://pkg.go.dev/github.com/cenkalti/backoff/v4#Stop
I think direct comparison is the better option here.
@@ -347,7 +346,13 @@ func (s *SchemeWithRetries) Fetch(ctx context.Context, u *url.URL) (io.ReaderAt, | |||
if s.DoRetry != nil && !s.DoRetry(u, err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always a little concerned when I see these nil comparison, makes me feel like the design is a bit off, but that's not your code ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A type like Optional[T] would like make it more obvious, but I think the != nil here is used to make this DoRetry an optional value. When nil we retry until success.
var r io.Reader | ||
// Note: err uses the scope outside the for loop. | ||
r, err = s.Scheme.FetchWithoutCache(ctx, u) | ||
for d := back.NextBackOff(); d != backoff.Stop; d = back.NextBackOff() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question < instead of !=? Also, not your code, but these duplicated code blocks hint at a possibility that this could might be done differently? Not germane to this PR.
Looks like this PR compensate time to fetch via part of next wait time. If time to fetch takes forever, and til the deadline, then it looks like we won't wait at all, and go straight to next loop iteration, and immediately start fetch again. Could you share more details / motivation for this proposed behavior ? I think there is no one perfect wait behavior that will work perfect for all scenarios. I am concerned this proposal might be inspired by a specific case you are dealing with, and can cause regression for more general cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment, hope to understand the motivation.
This indeed comes from a problem, or non optimal situation where I got in for myself. I did however try to keep the general case in mind. In my scenario. The call to fetch would time out on the TCP dial every time. The dial timeout was intentionally set to a low number for reasons that were not very clear for myself. My reasoning for why this change would be ok would be as follows. Most failures will fail fast, for which this change has near to zero impact. It would be a tiny fraction faster but not noticable. For failures that take as long as the duration until the next backoff, I assume that this is still a single attempt with a long timeout, that at some point started to hang. Which means it was actually already sleeping and not doing anything. Sleeping longer would actually go against the intend of the configuration. Additionally passing in a limited timeout to Fetch here means that failing calls get properly canceled which might even reduce resource useage when cancellation is properly used and implemented in the Fetch call. |
Thanks, to clarify, can you also share how reduced wait help the internal case ? |
The reduced wait time would not actually have much impact on the internal case. I only proposed this change because I think it would make the behavior of SchemeWithRetries more predictable. After having thought about this and discussed with a colleague, I'm less convinced about this change. So if you are not convinced now that this change is an improvement, let's not do it. |
Thank you for the update Kobe ! -- no objection for improvement to solve for one more case in general. Given we are all not convinced, as there is no positive test result backing the change, I am going to adjourn this PR for now. |
Fixes #2786