Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ES-1122621] fix receive connection re-establish taking too long (5m) #36

Merged
merged 3 commits into from
May 14, 2024

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented May 13, 2024

We've seen some 503 errors when quorum supposed to meet as the ingestor is in running state, however router still can't talk to it, able to reproduce this behavior locally:

Screenshot 2024-05-10 at 10 33 42 PM

I've introduced a unit test to mock the prod environment:

  1. Setup 2 ingestors, 1 with fixed ip1, 1 with a DNS and resolve to ip2
  2. Verify router can write data to 2 ingestors when quorum == 2
  3. Shutdown 1 ingestor with ip2, and write requests start to fail
  4. Spawn another ingestor with ip3 and bind it to previous DNS which resolved to ip2 to simulate ingestor restarts
  5. OLD behavior without any changes in handler.go, the unit test won't succeed after a while
  6. NEW behavior in handler.go, the unit test will succeed quickly
  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

Before the change:
Screenshot 2024-05-13 at 9 45 09 PM

After the change, reduced rollout operator with 5s delay between sts, still see small 503 but m3 write coordinator retries should help, extend to a bit longer delays like 1m and it should not have any 503:

Screenshot 2024-05-13 at 11 43 39 PM

Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi changed the title [ES-1122621] close conn if backoff retries too many times [ES-1122621] fix receive connection re-establish issue May 14, 2024
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Copy link
Collaborator

@hczhu-db hczhu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome debugging.

Comment on lines 1309 to 1316
if err := p.closeUnlocked(addr); err != nil {
return err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it remember the error, but continue and try to close all connections?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we should do this.

return c, nil
}
level.Debug(p.logger).Log("msg", "dialing peer", "addr", addr)
conn, err := p.dialer(ctx, addr, p.dialOpts...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched in the repo. This is the only place where p.dialer is called. That means the handler never re-establishs a connection to a DNS. Before your change, the write errors eventually die down. That means the gRPC framework eventually gets the correct IP and re-establishs the connection under the hood, but the delay is quite long and causes many write errors.
I'm concerned that there might be a reason that the gRPC call site (receive handler) doesn't explicitly dial again. Should we ask the community?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the default gRPC dial options in Thanos is not ideal. Note that Unavailable(503) is retriable.
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it is way long and probably on purpose to avoid busy reconnection during normal period:

Screenshot 2024-05-13 at 8 34 52 PM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the default idle timeout to 5 Minutes is the root cause, i was able to pass the unit tests with 1 second idle timeout.

if !ok {
p.connections[addr] = newPeerWorker(conn, p.forwardDelay, p.asyncForwardWorkersCount)
} else {
p.connections[addr].cc = conn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if, in reality (in contrast to the unit test), re-dialing still sees the stale IP because of some weird interaction between gRPC's DNS resolver and Kubernetes networking (CoreDNS)?

Copy link
Collaborator Author

@jnyi jnyi May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan to revert this one.

@@ -1735,3 +1785,91 @@ func TestHandlerFlippingHashrings(t *testing.T) {
cancel()
wg.Wait()
}

func TestIngestorRestart(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome unit test!

@jnyi jnyi requested a review from hczhu-db May 14, 2024 05:34
@@ -158,6 +158,7 @@ func runReceive(
dialOpts = append(dialOpts, grpc.WithDefaultCallOptions(grpc.UseCompressor(conf.compression)))
}
if receiveMode == receive.RouterOnly {
dialOpts = append(dialOpts, grpc.WithIdleTimeout(time.Duration(*conf.maxBackoff)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a cmd flag to change conf.maxBackoff? If no, better to add one so that we can tune it later without rebuilding an image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is from cmd flag

defer p.m.Unlock()
var err error
for addr := range p.connections {
err = p.closeUnlocked(addr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be

if er := p.closeUnlocked(addr); er != nil {
  err = er
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use multi errors

Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi requested a review from hczhu-db May 14, 2024 16:52
@jnyi jnyi changed the title [ES-1122621] fix receive connection re-establish issue [ES-1122621] fix receive connection re-establish taking too long (5m) May 14, 2024
@jnyi jnyi merged commit 971761a into databricks:db_main May 14, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants