Failed to check key count on partition #130

hulucc · 2021-11-30T08:43:47Z

v0.4.0, after 7 days running, showing error.

ShawnHsiung · 2021-12-05T03:33:01Z

Failed to check key count on partition: 254: read tcp 10.233.85.149:48176->10.233.85.149:3320: i/o timeout => distribute.go:65

Failed to check key count on partition: 160: context deadline exceeded => distribute.go:65

the same log met, it's may cause by memberships exception

in my environments, I deployed 2 pods in k8s cluster, but not make up cluster because of weird exception, what the exception is the pods status show running, but one pod is exception actually (cannot kubectl exec to pod, but can do docker exec), I add a livenessProbe temporarily

daniel-yavorovich · 2021-12-13T21:16:49Z

The same issue from all members:

rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:45 [ERROR] Failed to update routing table on 10.31.2.49:3320: EOF => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on 10.31.2.130:3320: context deadline exceeded => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on 10.31.2.20:3320: context deadline exceeded => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on 10.31.2.122:3320: context deadline exceeded => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on cluster: EOF => routingtable.go:238
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:49 [INFO] Routing table has been pushed by 10.31.5.66:3320 => operations.go:87```

the problem is reproduced only under production traffic, therefore it is difficult to catch it on synthetic tests

buraksezer · 2021-12-13T21:34:55Z

Hi all,

I'm so sorry for my late response. The context deadline exceeded string is raised by the following function.

func (c *Client) conn(addr string) (net.Conn, error) {
	p, err := c.pool(addr)
	if err != nil {
		return nil, err
	}

	var (
		ctx    context.Context
		cancel context.CancelFunc
	)
	if c.config.PoolTimeout > 0 {
		ctx, cancel = context.WithTimeout(context.Background(), c.config.PoolTimeout)
		defer cancel()
	} else {
		ctx = context.Background()
	}

	conn, err := p.Get(ctx)
	if err != nil {
		return nil, err
	}

	if c.config.HasTimeout() {
		// Wrap the net.Conn to implement timeout logic
		conn = NewConnWithTimeout(conn, c.config.ReadTimeout, c.config.WriteTimeout)
	}
	return conn, err
}

This function tries to acquire a new connection from the pool for the given address, but it cannot do that. In order to inspect the problem, you may want to:

Disable PoolTimeout,
Increase the number of allowed connections in the pool.

I think the actual problem is much deeper. Because I see the following lines in the logs:

rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:45 [ERROR] Failed to update routing table on 10.31.2.49:3320: EOF => update.go:58
...
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on cluster: EOF => routingtable.go:238

and (from ShawnHsiung)

Failed to check key count on partition: 254: read tcp 10.233.85.149:48176->10.233.85.149:3320: i/o timeout => distribute.go:65

It seems that the network is unstable for these cases. Could you please give more information about your setup? Is that Kubernetes, right?

Source:

olric/internal/transport/client.go

Line 124 in b41f763

ctx, cancel = context.WithTimeout(context.Background(), c.config.PoolTimeout)

buraksezer · 2021-12-20T15:30:44Z

Hey,

I double-checked the suspected parts of the code. Everything looks fine to me. I'm going to try to create load locally and inject errors into the cluster. It'll take some time. Do you guys give some info about your setup and configuration? Could you please maxConn and poolTimeout directives from your config? How many members do you have in the cluster?

Your help is appreciated

buraksezer · 2021-12-20T17:56:01Z

Another question, does the cluster work seamlessly despite these errors?

buraksezer · 2021-12-20T21:22:30Z

I just released v0.4.1. It improves error messages. I hope this version will increase our understanding of this problem.

See https://github.com/buraksezer/olric/releases/tag/v0.4.1

Probably there is nothing wrong with the code, but the network is unstable somehow. Because there is no proper and immediate way to close dead TCP sockets in the pool, sometimes Olric may try to use dead TCP sockets to send messages over the wire. This can explain unexpected EOF or Broken pipe kind error messages in the logs. Olric is designed to overcome this kind of problem. For example, the cluster coordinator constantly inspects key counts on the partitions. This check may fail occasionally. That error should be no problem at all.

I need to know the answers to these questions to debug the problem:

Does the network between pods/containers/virtual machines function properly? How do you know?
Does the cluster continue to function properly despite these errors?

hulucc · 2022-01-04T02:23:08Z

Hey,

I double-checked the suspected parts of the code. Everything looks fine to me. I'm going to try to create load locally and inject errors into the cluster. It'll take some time. Do you guys give some info about your setup and configuration? Could you please maxConn and poolTimeout directives from your config? How many members do you have in the cluster?

Your help is appreciated

I think increase maxConn and poolTimeout will just delay the problem, not solve. I can reproduce the problem with single instance so I don't think it's the network unstable problem. And yes, I'm in k8s. And v0.4.0-beta.2 was ok.

hulucc · 2022-01-04T02:26:34Z

I just released v0.4.1. It improves error messages. I hope this version will increase our understanding of this problem.

See https://github.com/buraksezer/olric/releases/tag/v0.4.1

Probably there is nothing wrong with the code, but the network is unstable somehow. Because there is no proper and immediate way to close dead TCP sockets in the pool, sometimes Olric may try to use dead TCP sockets to send messages over the wire. This can explain unexpected EOF or Broken pipe kind error messages in the logs. Olric is designed to overcome this kind of problem. For example, the cluster coordinator constantly inspects key counts on the partitions. This check may fail occasionally. That error should be no problem at all.

I need to know the answers to these questions to debug the problem:

Does the network between pods/containers/virtual machines function properly? How do you know?

Does the cluster continue to function properly despite these errors?

The network was interrupted maybe, but it's ok eventually and olric can not recovery(I know this because restart the cluster will fix the problem). And the cluster can not work properly while having these errors.

buraksezer · 2022-01-04T17:43:04Z

@hulucc

There is a configuration option for TCP KeepAlives.

If you use the YAML file to configure the nodes:
https://github.com/buraksezer/olric/blob/v0.4.2/cmd/olricd/olricd.yaml#L15

For embedded-member mode:
https://github.com/buraksezer/olric/blob/v0.4.2/config/client.go#L55

It is 300s in the sample configuration file but there is no default value currently.

I think the root cause behind this story is buraksezer/connpool. It somehow fails to remove dead connections from the pool. I checked everything, but I cannot find any problem with it. We can also replace it with jackc/puddle.

You may try to reduce the TCP connection keep-alive. It seems that the Kubernetes networking layer drops connections without the termination phase.

And the cluster can not work properly while having these errors.

Failed to check key count on partition... is an unimportant error message. The coordinator node constantly tries to inspect the key count on the partitions. If one of the requests fails, it will try again shortly after.

Is there any error message in the logs? I need more details to debug that problem.

buraksezer · 2022-01-04T18:20:31Z

I saw this function in the Spiracle repository: https://github.com/LilithGames/spiracle/blob/ed2af92da1d13f1afa11b6a354db33048930655b/infra/db/db.go#L25

If you still use the same function or similar one, config.Config.KeepAlivePeriod is zero. That means TCP keepalive is disabled by default. I think we need to set a default value.

Could you please try to set 300s or lower? This is how it works in the server code:

olric/internal/transport/server.go

Line 283 in 23cd56d

if s.config.KeepAlivePeriod != 0 {

This is how Redis handles keepalives: https://redis.io/topics/clients#tcp-keepalive

hulucc · 2022-01-06T12:05:32Z

I saw this function in the Spiracle repository: https://github.com/LilithGames/spiracle/blob/ed2af92da1d13f1afa11b6a354db33048930655b/infra/db/db.go#L25

If you still use the same function or similar one, config.Config.KeepAlivePeriod is zero. That means TCP keepalive is disabled by default. I think we need to set a default value.

Could you please try to set 300s or lower? This is how it works in the server code:

olric/internal/transport/server.go

Line 283 in 23cd56d

if s.config.KeepAlivePeriod != 0 {

This is how Redis handles keepalives: https://redis.io/topics/clients#tcp-keepalive

Ok, I will try this. However, I can get this error with single embedded instance(all reads and writes are from the same process, port exposed only for debug), did olric keep some tcp connections to itself?

ivano85 · 2022-09-29T22:00:24Z

Hello, I embedded Olric and I have the same issue. At the moment I'm using v0.4.5. This is my log:

2022/09/29 23:09:14 [ERROR] Failed to check key count on partition: 269: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => distribute.go:65
2022/09/29 23:09:17 [ERROR] Failed to check key count on partition: 270: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => distribute.go:65
2022/09/29 23:09:20 [ERROR] Failed to update routing table on 192.168.1.2:3320: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => update.go:58
2022/09/29 23:09:20 [ERROR] Failed to update routing table on cluster: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => routingtable.go:238

I don't understand why there are 270 partitions, because I just use a couple of maps with just a few keys each. Anyway what happens here is that during this phase any operation seems to hang until the error appears for partition 270, then pending operations go on and it starts again from the first partition.

But why does olric need to open a tcp connection to itself?

This is my initialization code:

c := olric_conf.New("lan")
ctx, cancel := context.WithCancel(context.Background())
c.Started = func() {
    defer cancel()
}

olricSD := &OlricEmbeddedDiscovery{}
olricSD.olricConfig = c
sd := make(map[string]interface{})
sd["plugin"] = olricSD
c.ServiceDiscovery = sd

var err error
olricDb, err = olric.New(c)
if err != nil {
    log.Fatal(err)
}

go func() {
    err = olricDb.Start()
    if err != nil {
        fmt.Printf("********* Error starting Olric: %v\n", err)
        olricDb.Shutdown(ctx)
        cancel()
    }
}()

<-ctx.Done()

// Bootstrapping fs
stats, err := olricDb.Stats()
for id, m := range stats.ClusterMembers {
    fmt.Printf("%d -> %s\n", id, m.Name)
}

clusterUuid = stats.Member.ID

ownMap, err = olricDb.NewDMap(fmt.Sprintf("cluster-%d", stats.Member.ID))
if err != nil {
    return err
}
ownMap.Put("UUID", GetConfig().UUID)
ownMap.Put("node.uuid", GetConfig().UUID)
ownMap.Put("node.name", GetConfig().Name)

buraksezer · 2022-09-29T22:18:39Z

Hello,

At the moment I'm using v0.4.5.

Please consider upgrading to the latest stable version. Connection pooling dependency has received a few bug fixes.

I don't understand why there are 270 partitions because I just use a few maps with just a few keys each.

Olric is a partitioned data store. That mechanism distributes all your data among the cluster members. You can increase or decrease partition numbers, but prime numbers are good for equal distribution.

But why does olric need to open a tcp connection to itself?

It is not ideal; I know this, but it's completely unimportant. The coordinator node fetches all available nodes from the memberlist instance and sends a request to learn some statistical data from the members. The payload is quite low. I didn't want to add another code path to keep the complexity low. As I mentioned, the payload is only a few bytes long.

2022/09/29 23:09:14 [ERROR] Failed to check key count on partition: 269: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => distribute.go:65

Does Olric run your local environment or Docket network? Something should be different in your environment. I need to know that difference to reproduce the problem on my side.

Did you check the previous messages?

buraksezer · 2022-09-29T22:26:05Z

For example, do you use any firewall software which blocks Olric? Something should be different in your network.

ivano85 · 2022-09-29T23:05:30Z

Sorry, maybe I had to mention the software embedding olric is running on a Raspberry Pi and I'm testing it with up to three nodes (most of the time just one). All nodes run on my home wifi network (I suppose it causes most of the issues when multiple nodes are involved).

Unfortunately the issue usually happens after running for several days (like the issue's creator), so it's very difficult to understand. To make it harder, it happens even if just one node runs all the time, so it doesn't seem to be directly related to network failures...

Anyway I will wait for the issue again to analyze it more accurately, because I'm dealing with other changes at the moment. I will let you know.

buraksezer · 2022-09-29T23:08:35Z

Hello, I see. Olric v0.4.7 has this fix: buraksezer/connpool#4. This may fix the problem.

ivano85 · 2022-09-29T23:33:38Z

It looks promising! I will test it in the next days and let you know 😉

buraksezer self-assigned this Dec 1, 2021

buraksezer added the question Further information is requested label Dec 1, 2021

buraksezer added a commit that referenced this issue Dec 20, 2021

chore: improve error messages #130

8e88f97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to check key count on partition #130

Failed to check key count on partition #130

hulucc commented Nov 30, 2021 •

edited

ShawnHsiung commented Dec 5, 2021 •

edited

daniel-yavorovich commented Dec 13, 2021

buraksezer commented Dec 13, 2021

buraksezer commented Dec 20, 2021

buraksezer commented Dec 20, 2021 •

edited

buraksezer commented Dec 20, 2021

hulucc commented Jan 4, 2022

hulucc commented Jan 4, 2022

buraksezer commented Jan 4, 2022

buraksezer commented Jan 4, 2022

hulucc commented Jan 6, 2022 •

edited

ivano85 commented Sep 29, 2022

buraksezer commented Sep 29, 2022

buraksezer commented Sep 29, 2022

ivano85 commented Sep 29, 2022

buraksezer commented Sep 29, 2022

ivano85 commented Sep 29, 2022

Failed to check key count on partition #130

Failed to check key count on partition #130

Comments

hulucc commented Nov 30, 2021 • edited

ShawnHsiung commented Dec 5, 2021 • edited

daniel-yavorovich commented Dec 13, 2021

buraksezer commented Dec 13, 2021

buraksezer commented Dec 20, 2021

buraksezer commented Dec 20, 2021 • edited

buraksezer commented Dec 20, 2021

hulucc commented Jan 4, 2022

hulucc commented Jan 4, 2022

buraksezer commented Jan 4, 2022

buraksezer commented Jan 4, 2022

hulucc commented Jan 6, 2022 • edited

ivano85 commented Sep 29, 2022

buraksezer commented Sep 29, 2022

buraksezer commented Sep 29, 2022

ivano85 commented Sep 29, 2022

buraksezer commented Sep 29, 2022

ivano85 commented Sep 29, 2022

hulucc commented Nov 30, 2021 •

edited

ShawnHsiung commented Dec 5, 2021 •

edited

buraksezer commented Dec 20, 2021 •

edited

hulucc commented Jan 6, 2022 •

edited