Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SchemeWithRetries timeout handling #2787

Closed
wants to merge 2 commits into from

Conversation

HouseK
Copy link

@HouseK HouseK commented Oct 20, 2023

Fixes #2786

2 changes.

- We limit the timeout of the context passed to `Fetch` to the time
until the next backoff is.
- We only sleep up to this timeout and subtract the duration that Fetch
ran for.

We expect that this change behaves exactly the same for quickly
failing `Fetch` calls. But will do faster retries for `Fetch` calls that
timeout.

Signed-off-by: Kobe Housen <kobehousen@google.com>
@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Files Coverage Δ
pkg/curl/schemes.go 64.51% <90.90%> (+2.01%) ⬆️

... and 5 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@rminnich rminnich requested a review from 10000TB October 20, 2023 18:00
Copy link
Member

@rminnich rminnich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The != vs < is the one I worry about, but I noticed a few other things that bug me a bit.

var r io.ReaderAt
// Note: err uses the scope outside the for loop.
r, err = s.Scheme.Fetch(ctx, u)
for d := back.NextBackOff(); d != backoff.Stop; d = back.NextBackOff() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be < ? I had a problem in a timeout scheme once with != when the timer skipped over the stop value and kept running.

I worry about direct comparison on anything related to time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop is a magical value that does not represent time.

https://pkg.go.dev/github.com/cenkalti/backoff/v4#Stop

I think direct comparison is the better option here.

@@ -347,7 +346,13 @@ func (s *SchemeWithRetries) Fetch(ctx context.Context, u *url.URL) (io.ReaderAt,
if s.DoRetry != nil && !s.DoRetry(u, err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a little concerned when I see these nil comparison, makes me feel like the design is a bit off, but that's not your code ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A type like Optional[T] would like make it more obvious, but I think the != nil here is used to make this DoRetry an optional value. When nil we retry until success.

var r io.Reader
// Note: err uses the scope outside the for loop.
r, err = s.Scheme.FetchWithoutCache(ctx, u)
for d := back.NextBackOff(); d != backoff.Stop; d = back.NextBackOff() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question < instead of !=? Also, not your code, but these duplicated code blocks hint at a possibility that this could might be done differently? Not germane to this PR.

@rminnich rminnich added the Awaiting author Waiting for new changes or feedback for author. label Oct 20, 2023
@10000TB
Copy link
Member

10000TB commented Oct 20, 2023

Looks like this PR compensate time to fetch via part of next wait time.

If time to fetch takes forever, and til the deadline, then it looks like we won't wait at all, and go straight to next loop iteration, and immediately start fetch again.

Could you share more details / motivation for this proposed behavior ?

I think there is no one perfect wait behavior that will work perfect for all scenarios. I am concerned this proposal might be inspired by a specific case you are dealing with, and can cause regression for more general cases

Copy link
Member

@10000TB 10000TB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment, hope to understand the motivation.

@HouseK
Copy link
Author

HouseK commented Oct 21, 2023

Looks like this PR compensate time to fetch via part of next wait time.

If time to fetch takes forever, and til the deadline, then it looks like we won't wait at all, and go straight to next loop iteration, and immediately start fetch again.

Could you share more details / motivation for this proposed behavior ?

I think there is no one perfect wait behavior that will work perfect for all scenarios. I am concerned this proposal might be inspired by a specific case you are dealing with, and can cause regression for more general cases

This indeed comes from a problem, or non optimal situation where I got in for myself. I did however try to keep the general case in mind.

In my scenario. The call to fetch would time out on the TCP dial every time. The dial timeout was intentionally set to a low number for reasons that were not very clear for myself.

My reasoning for why this change would be ok would be as follows.

Most failures will fail fast, for which this change has near to zero impact. It would be a tiny fraction faster but not noticable.

For failures that take as long as the duration until the next backoff, I assume that this is still a single attempt with a long timeout, that at some point started to hang. Which means it was actually already sleeping and not doing anything. Sleeping longer would actually go against the intend of the configuration.

Additionally passing in a limited timeout to Fetch here means that failing calls get properly canceled which might even reduce resource useage when cancellation is properly used and implemented in the Fetch call.

@10000TB
Copy link
Member

10000TB commented Oct 23, 2023

Thanks, to clarify, can you also share how reduced wait help the internal case ?

@HouseK
Copy link
Author

HouseK commented Oct 24, 2023

The reduced wait time would not actually have much impact on the internal case.

I only proposed this change because I think it would make the behavior of SchemeWithRetries more predictable.

After having thought about this and discussed with a colleague, I'm less convinced about this change.

So if you are not convinced now that this change is an improvement, let's not do it.

@10000TB
Copy link
Member

10000TB commented Nov 2, 2023

Thank you for the update Kobe ! -- no objection for improvement to solve for one more case in general.

Given we are all not convinced, as there is no positive test result backing the change, I am going to adjourn this PR for now.

@10000TB 10000TB closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting author Waiting for new changes or feedback for author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SchemeWithRetries.Fetch improved timeout handling
3 participants