[ES-1122621] fix receive connection re-establish taking too long (5m) #36

jnyi · 2024-05-13T16:34:50Z

We've seen some 503 errors when quorum supposed to meet as the ingestor is in running state, however router still can't talk to it, able to reproduce this behavior locally:

I've introduced a unit test to mock the prod environment:

Setup 2 ingestors, 1 with fixed ip1, 1 with a DNS and resolve to ip2
Verify router can write data to 2 ingestors when quorum == 2
Shutdown 1 ingestor with ip2, and write requests start to fail
Spawn another ingestor with ip3 and bind it to previous DNS which resolved to ip2 to simulate ingestor restarts
OLD behavior without any changes in handler.go, the unit test won't succeed after a while
NEW behavior in handler.go, the unit test will succeed quickly

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Verification

Before the change:

After the change, reduced rollout operator with 5s delay between sts, still see small 503 but m3 write coordinator retries should help, extend to a bit longer delays like 1m and it should not have any 503:

Signed-off-by: Yi Jin <yi.jin@databricks.com>

hczhu-db

awesome debugging.

hczhu-db · 2024-05-14T01:27:11Z

pkg/receive/handler.go

+		if err := p.closeUnlocked(addr); err != nil {
+			return err
+		}


Should it remember the error, but continue and try to close all connections?

yep, we should do this.

hczhu-db · 2024-05-14T01:40:08Z

pkg/receive/handler.go

 		return c, nil
 	}
+	level.Debug(p.logger).Log("msg", "dialing peer", "addr", addr)
 	conn, err := p.dialer(ctx, addr, p.dialOpts...)


I searched in the repo. This is the only place where p.dialer is called. That means the handler never re-establishs a connection to a DNS. Before your change, the write errors eventually die down. That means the gRPC framework eventually gets the correct IP and re-establishs the connection under the hood, but the delay is quite long and causes many write errors.
I'm concerned that there might be a reason that the gRPC call site (receive handler) doesn't explicitly dial again. Should we ask the community?

I suspect the default gRPC dial options in Thanos is not ideal. Note that Unavailable(503) is retriable.

seems it is way long and probably on purpose to avoid busy reconnection during normal period:

actually the default idle timeout to 5 Minutes is the root cause, i was able to pass the unit tests with 1 second idle timeout.

hczhu-db · 2024-05-14T01:42:33Z

pkg/receive/handler.go

+	if !ok {
+		p.connections[addr] = newPeerWorker(conn, p.forwardDelay, p.asyncForwardWorkersCount)
+	} else {
+		p.connections[addr].cc = conn


What if, in reality (in contrast to the unit test), re-dialing still sees the stale IP because of some weird interaction between gRPC's DNS resolver and Kubernetes networking (CoreDNS)?

plan to revert this one.

hczhu-db · 2024-05-14T01:50:15Z

pkg/receive/handler_test.go

@@ -1735,3 +1785,91 @@ func TestHandlerFlippingHashrings(t *testing.T) {
 	cancel()
 	wg.Wait()
 }
+
+func TestIngestorRestart(t *testing.T) {


Awesome unit test!

hczhu-db · 2024-05-14T15:26:43Z

cmd/thanos/receive.go

@@ -158,6 +158,7 @@ func runReceive(
 		dialOpts = append(dialOpts, grpc.WithDefaultCallOptions(grpc.UseCompressor(conf.compression)))
 	}
 	if receiveMode == receive.RouterOnly {
+		dialOpts = append(dialOpts, grpc.WithIdleTimeout(time.Duration(*conf.maxBackoff)))


Is there a cmd flag to change conf.maxBackoff? If no, better to add one so that we can tune it later without rebuilding an image.

yes, it is from cmd flag

hczhu-db · 2024-05-14T15:27:39Z

pkg/receive/handler.go

+	defer p.m.Unlock()
+	var err error
+	for addr := range p.connections {
+		err = p.closeUnlocked(addr)


Should be

if er := p.closeUnlocked(addr); er != nil { err = er }

updated to use multi errors

Signed-off-by: Yi Jin <yi.jin@databricks.com>

[ES-1122621] close conn if backoff retries too many times

d8a548d

Signed-off-by: Yi Jin <yi.jin@databricks.com>

jnyi requested review from hczhu-db, christopherzli and yuchen-db May 14, 2024 00:56

jnyi changed the title ~~[ES-1122621] close conn if backoff retries too many times~~ [ES-1122621] fix receive connection re-establish issue May 14, 2024

jnyi force-pushed the ES-1122621 branch from 51e5b45 to b81d974 Compare May 14, 2024 01:08

[ES-1122621] Fix receive connection idle if ingestor restarts

6b27470

Signed-off-by: Yi Jin <yi.jin@databricks.com>

jnyi force-pushed the ES-1122621 branch from b81d974 to 6b27470 Compare May 14, 2024 01:34

hczhu-db reviewed May 14, 2024

View reviewed changes

jnyi requested a review from hczhu-db May 14, 2024 05:34

hczhu-db reviewed May 14, 2024

View reviewed changes

[ES-1122621] use idle timeout to reconnect

98088e7

Signed-off-by: Yi Jin <yi.jin@databricks.com>

jnyi force-pushed the ES-1122621 branch from 0c2f3fb to 98088e7 Compare May 14, 2024 16:52

jnyi requested a review from hczhu-db May 14, 2024 16:52

jnyi changed the title ~~[ES-1122621] fix receive connection re-establish issue~~ [ES-1122621] fix receive connection re-establish taking too long (5m) May 14, 2024

yuchen-db approved these changes May 14, 2024

View reviewed changes

jnyi merged commit 971761a into databricks:db_main May 14, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES-1122621] fix receive connection re-establish taking too long (5m) #36

[ES-1122621] fix receive connection re-establish taking too long (5m) #36

jnyi commented May 13, 2024 •

edited

hczhu-db left a comment

hczhu-db May 14, 2024

jnyi May 14, 2024

hczhu-db May 14, 2024

hczhu-db May 14, 2024

jnyi May 14, 2024

jnyi May 14, 2024

hczhu-db May 14, 2024

jnyi May 14, 2024 •

edited

hczhu-db May 14, 2024

hczhu-db May 14, 2024

jnyi May 14, 2024

hczhu-db May 14, 2024

jnyi May 14, 2024

[ES-1122621] fix receive connection re-establish taking too long (5m) #36

[ES-1122621] fix receive connection re-establish taking too long (5m) #36

Conversation

jnyi commented May 13, 2024 • edited

Changes

Verification

hczhu-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnyi May 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnyi commented May 13, 2024 •

edited

jnyi May 14, 2024 •

edited