-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c-s load failed during cluster rolling restart - failed to get QUORUM, not enough replicas available #18647
Comments
That reactor stall is not new (see #13758 (comment) and https://github.com/scylladb/scylla-enterprise/issues/3963#issue-2161024203 - and I remember (but can't find right now!) more. |
@juliayakovlev - where's the kernel stack? |
@juliayakovlev - what encryption was configured here, btw? Client <-> server? server <-> server? both? |
If do you mean original file - it's in the node logs (https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/db-cluster-a6bbb535.tar.gz) |
both |
I couldn't find a single kernel stack in the logs. All empty? |
The file is named "kallsyms_20240512_075635" in the In the node log I see only:
Not sure what it means |
this is exactly what I mean - I don't see any kernel stack. |
run from last week ( passed via this nemesis with success @juliayakovlev, let give it antheor run, to see if reproducible |
I hate this, I hate this. This isn't the first (or 100th) time we are debugging why queries failed for unclear reasons. But Scylla knows very well which replicas were available, which were queried, which failed, and what reasons did they present. Why can't we just make it tell us? |
Do you expect a log on the coordinator for every drop? |
Log? No, I would like more information to be added to the error returned to the client, more than just the number of replicas which failed. |
Also, if cassandra-stress retries each operation 10 times, it should print all 10 errors, not just the last one. Also, it should report the coordinator. The client knows which coordinator it picked. All we know from these errors is: Wouldn't it be better if the error reports narrowed down the problem more? With this, we don't even know if the restarted node was a coordinator or a replica, let alone why the replica failed, or what's up with the third, uncontacted replica. |
The current protocol does not provide more information - https://github.com/apache/cassandra/blob/6bae4f76fb043b4c3a3886178b5650b280e9a50b/doc/native_protocol_v4.spec#L1076 |
The issue was not reproduced in https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=9adcc62d-4f9f-4b92-9316-87279f4c1b92 run |
One day we can improve the tools to provide more information, anyway it's a side tracking to this issue. |
Try to reproduce scylladb/scylladb#18647 issue , run the test with ClusterRollingRestart nemesis only
Try to reproduce scylladb/scylladb#18647 issue , run the test with ClusterRollingRestart nemesis only
Reproducer with rolling restart cluster nemesis only. PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 6 nodes (i4i.4xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@juliayakovlev - anything relevant in the replica logs at the time of failure? |
@roydahan - please open a tracking issue for this. Sounds like an easy additional log in the stress tool (c-s?) that could help us. |
@kbr-scylla suspect it's a duplicate of #15899, let's fix #15899 and re-test this |
Issue was not reproduced with Scylla version |
I did not find nothing new |
@juliayakovlev could you please also check if it reproduces on 5.4? |
@juliayakovlev wrote 2 days ago:
|
Sorry I missed it. In this case this is a regression and it is not a duplicate of #15899 (which according to the report, happened way back in 5.1)! I think it's a major issue -- availability disruption during rolling restart. Giving it P1 priority and release blocker status. Actually I already have a suspicion what could be the cause: removing wait-for-gossip-to-settle on node restart before "completing initialization" :( (cc @kostja @gleb-cloudius ) 65cfb9b We should retest with that final wait-for-gossip-to-settle restored (65cfb9b removed two waits -- I believe we only need the second one for preserving availability) If so, we should consider:
|
Modified original post (this is a regression) |
I think that servers that were considered restarted and joined the cluster do not The gap grows and grows, until eventually there isn't enough capacity and we reach a timeout. It's probably not a regression, just an issue we may have in general, need to research
|
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 6 nodes (i4i.4xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@juliayakovlev @roydahan how is this rolling restart nemesis decides that it can restart next node? |
once node is listening on cql, it moves to the next one |
since couple days we we verify cql port much more often (was every 60 seconds, now every 5), so issue could be emphasized. |
There was a long thread about it not being enough and the need for additional checks (scylladb/scylla-ccm#564 implemented this on CCM, and I believe there was a similar issue for dtest?). Specifically, ensure all OTHER nodes see that node as alive and owning its share of the ring? |
Yes, just checking CQL port is not enough. But it worked. We still need to figure out what changed. |
Before 65cfb9b, CQL port opened meant that gossip has settled. After this commit, it not longer does. |
I know :) But we did not confirm it yet. Also why gossip settling guaranties that all nodes see all other nodes as alive? May be it is just because it takes time and it does not guaranty it in reality. |
That's my guess too -- there was no guarantee, but since wait-for-gossip-to-settle always took at least a few seconds, in practice the observable result was that all nodes saw this one as UP before continuing rolling restart on the next node. |
a strong sense of dejavu here, around this question. but what's the next step ? how can a user do a safe rolling restart with this version ? |
The proper procedure for rolling restart was always to wait for the CQL port and wait for all nodes to see the restarted node as UP. |
BTW our docs are vague about it Step 5 says
but it doesn't say that And we have to admit that it's pretty inconvenient to have to connect to every node and execute |
The node may do it itself before opening CQL port like it does with shutdown notification, but this is not what "waiting for gossiper to settle" was doing, so this a different feature request. |
Anyway, SCT for this case don't even do it on single node. |
We can ask our Field engineers. @tarzanek could you help answering this? But I suspect the manual drain is redundant -- graceful shutdown should already drain automatically before stopping the process. |
It's in Siren's code. But this is an OSS issue, so I won't paste the link. Generally, we do. And we have a timeout between drain and restart too, btw. |
Packages
Scylla version:
5.5.0~dev-20240510.28791aa2c1d3
with build-id893c2a68becf3d3bcbbf076980b1b831b9b76e29
Kernel Version:
5.15.0-1060-aws
Issue description
Cassandra-stress load (writes and reads) failed while
disrupt_rolling_restart_cluster
- failed to get QUORUM, not enough replicas availableThis nemesis restarts Scylla on all nodes (one by one) by running
sudo systemctl stop scylla-server.service
and thensudo systemctl start scylla-server.service
.Nodes order to restart:
The load failures happened after longevity-tls-50gb-3d-master-db-node-a6bbb535-6 was restarted and initialisation was completed.
During Scylla start very high foreground writes are observed on the longevity-tls-50gb-3d-master-db-node-a6bbb535-6. Writes started to fail while Scylla stop.
where red line - is
longevity-tls-50gb-3d-master-db-node-a6bbb535-6
node.Reactor stalls (32ms) and kernel callstacks
kallsyms_20240512_075635_result.log
Impact
Load failed
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0b7480423a402aa95
(aws: undefined_region)Test:
longevity-50gb-3days-test
Test id:
a6bbb535-3cf6-4f8b-b742-40ef856170ea
Test name:
scylla-master/tier1/longevity-50gb-3days-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor a6bbb535-3cf6-4f8b-b742-40ef856170ea
$ hydra investigate show-logs a6bbb535-3cf6-4f8b-b742-40ef856170ea
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: