nodetool repair command failed with exit code3 during drop keyspace #18490

aleksbykov · 2024-05-01T13:36:04Z

Packages

Scylla version: 5.5.0~dev-20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b

Kernel Version: 5.15.0-1060-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Issue looks like a regression.

During nemesis no_corrupt_repair, nodetool repair is started in background and while it running keyspaces are dropped.

The drop keyspace failed with error:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1594, in disrupt_no_corrupt_repair
    session.execute(SimpleStatement(
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1824, in execute_verbose
    return execute_orig(*args, **kwargs)
  File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 5018, in cassandra.cluster.ResponseFuture.result
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot drop non existing table 'standard1' in keyspace 'drop_table_during_repair_ks_8'."

Similar to issue #18479, but also
nodetool repair failed with error:

ommand: '/usr/bin/nodetool  repair '

Exit code: 3

Stdout:

[2024-04-20 21:54:22,765] Repair session 5
[2024-04-20 21:54:25,387] Repair session 5 finished
[2024-04-20 21:54:25,392] Starting repair command #6, repairing 1 ranges for keyspace drop_table_during_repair_ks_6 (parallelism=SEQUENTIAL, full=true)
[2024-04-20 21:54:25,392] Repair session 6
[2024-04-20 21:54:27,813] Repair session 6 finished
[2024-04-20 21:54:27,817] Starting repair command #7, repairing 1 ranges for keyspace drop_table_during_repair_ks_9 (parallelism=SEQUENTIAL, full=true)
[2024-04-20 21:54:27,817] Repair session 7
[2024-04-20 21:54:30,438] Repair session 7 finished
[2024-04-20 21:54:30,440] Starting repair command #8, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)
[2024-04-20 21:54:30,440] Repair session 8

Stderr:

Repair session 8 failed

and messages found in log of node 1:

2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:stmt] migration_manager - Drop table 'drop_table_during_repair_ks_8.standard1'
2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] schema_tables - Dropping drop_table_during_repair_ks_8.standard1 id=473b5c80-ff60-11ee-be67-1bff0286bc41 version=473b5c81-ff60-11ee-be67-1bff0286bc41
2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] database - Dropping drop_table_during_repair_ks_8.standard1 with auto-snapshot
2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] database - Truncating drop_table_during_repair_ks_8.standard1 with auto-snapshot


2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] database - Truncated drop_table_during_repair_ks_8.standard1
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 5:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 2:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 4:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 1:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.707+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 6:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.707+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 2:comp] compaction - [Compact system.truncated 261a14c0-ff64-11ee-b6cb-e72f0264c52b] Compacted 2 sstables to [/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gfg_1q1l_3n6812s2nk97nu7p63-big-Data.db:level=0]. 28kB to 22kB (~80% of original) in 31ms = 908kB/s. ~256 total partitions merged to 2.
2024-04-20T22:20:27.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1   !NOTICE | sudo[29145]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2024-04-20T22:20:27.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | sudo[29145]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2024-04-20T22:20:27.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | sudo[29145]: pam_unix(sudo:session): session closed for user root
2024-04-20T22:20:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1  !WARNING | scylla[18995]:  [shard 0:stmt] raft_group_registry - group [052c5d30-fec5-11ee-9de8-51e611cfd09d] raft operation [add_entry] timed out; timeout requested at [service/migration_manager.cc(972:109) `future<> service::migration_manager::announce_with_raft(std::vector<mutation>, group0_guard, std::string_view)`], original error raft::request_aborted (Request is aborted by a caller)
2024-04-20T22:20:32.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] migration_manager - Gossiping my schema version 0ed04c6c-ff64-11ee-cfaa-7002afcf9bf5
2024-04-20T22:20:32.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] schema_tables - Schema version changed to 0ed04c6c-ff64-11ee-cfaa-7002afcf9bf5
2024-04-20T22:20:37.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 6:strm] repair - repair[53e665c5-7f37-4502-9837-020eabbdbb48]: Started to repair 2 out of 2 tables in keyspace=scylla_bench, table=test_counters, table_id=42977d90-fec9-11ee-8e05-18b903b642ca, repair_reason=repair
2024-04-20T22:20:38.455+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 6:strm] repair - repair[53e665c5-7f37-4502-9837-020eabbdbb48]: stats: repair_reason=repair, keyspace=scylla_bench, tables={test, test_counters}, ranges_nr=798, round_nr=3799, round_nr_fast_path_already_synced=3628, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=171, rpc_call_nr=21042, tx_hashes_nr=2107, rx_hashes_nr=29516253, duration=1567.6217 seconds, tx_row_nr=3863, rx_row_nr=2107, tx_row_bytes=1993308, rx_row_bytes=2075395, row_from_disk_bytes={{10.4.8.101, 40136524572}, {10.4.9.183, 42518831460}, {10.4.9.17, 61499014392}, {10.4.10.107, 40342853868}}, row_from_disk_nr={{10.4.8.101, 118068917}, {10.4.9.183, 124976285}, {10.4.9.17, 180975837}, {10.4.10.107, 118906823}}, row_from_disk_bytes_per_sec={{10.4.8.101, 24.4174}, {10.4.9.183, 25.8666}, {10.4.9.17, 37.4134}, {10.4.10.107, 24.5429}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.101, 75317.2}, {10.4.9.183, 79723.5}, {10.4.9.17, 115446}, {10.4.10.107, 75851.7}} Rows/s, tx_row_nr_peer={{10.4.8.101, 1540}, {10.4.9.183, 1207}, {10.4.10.107, 1116}}, rx_row_nr_peer={{10.4.8.101, 1395}, {10.4.9.183, 208}, {10.4.10.107, 504}}

2024-04-20T22:38:00.455+00:00 longevity-twcs-48h-master-db-node-e5517f29-1  !WARNING | scylla[18995]:  [shard 0:strm] repair - repair[53e665c5-7f37-4502-9837-020eabbdbb48]: user-requested repair failed: std::runtime_error ({shard 5: std::runtime_error (repair[53e665c5-7f37-4502-9837-020eabbdbb48]: 0 out of 1596 ranges failed, keyspace=scylla_bench, tables={test, test_counters}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=false, failed_because=service::raft_operation_timeout_error (group [052c5d30-fec5-11ee-9de8-51e611cfd09d] raft operation [read_barrier] timed out))})

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-e5517f29-5 (34.240.166.199 | 10.4.10.107) (shards: 7)
longevity-twcs-48h-master-db-node-e5517f29-4 (52.210.172.137 | 10.4.9.76) (shards: 7)
longevity-twcs-48h-master-db-node-e5517f29-3 (54.75.94.63 | 10.4.8.101) (shards: 7)
longevity-twcs-48h-master-db-node-e5517f29-2 (18.201.32.18 | 10.4.9.183) (shards: 7)
longevity-twcs-48h-master-db-node-e5517f29-1 (54.194.79.51 | 10.4.9.17) (shards: 7)

OS / Image: ami-00f7c6bc1f946636b (aws: undefined_region)

Test: longevity-twcs-48h-test
Test id: e5517f29-ad39-4fcf-8203-8a8404416893
Test name: scylla-master/tier1/longevity-twcs-48h-test
Test config file(s):

longevity-twcs-48h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor e5517f29-ad39-4fcf-8203-8a8404416893
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs e5517f29-ad39-4fcf-8203-8a8404416893

Logs:

db-cluster-e5517f29.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e5517f29-ad39-4fcf-8203-8a8404416893/20240421_040240/db-cluster-e5517f29.tar.gz
sct-runner-events-e5517f29.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e5517f29-ad39-4fcf-8203-8a8404416893/20240421_040240/sct-runner-events-e5517f29.tar.gz
sct-e5517f29.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e5517f29-ad39-4fcf-8203-8a8404416893/20240421_040240/sct-e5517f29.log.tar.gz
loader-set-e5517f29.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e5517f29-ad39-4fcf-8203-8a8404416893/20240421_040240/loader-set-e5517f29.tar.gz
monitor-set-e5517f29.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e5517f29-ad39-4fcf-8203-8a8404416893/20240421_040240/monitor-set-e5517f29.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

denesb · 2024-05-08T07:25:05Z

@asias @Deexie what raft operation times out during repair? Is this the raft barrier we execute to check whether the schema is present?

Deexie · 2024-05-08T08:55:19Z

@asias @Deexie what raft operation times out during repair? Is this the raft barrier we execute to check whether the schema is present?

Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error. I think the raft error should be caught and the original one rethrown, right?

denesb · 2024-05-08T10:01:13Z

Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error.

Possibly yes.

I think the raft error should be caught and the original one rethrown, right?

The original error is probably due to the keyspace being dropped. In which case we don't want to raise it either.

Deexie · 2024-05-10T09:44:51Z

Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error.

Possibly yes.

I think the raft error should be caught and the original one rethrown, right?

The original error is probably due to the keyspace being dropped. In which case we don't want to raise it either.

But if the barrier timeouts, we can still not be updated and have no way to find it out

denesb · 2024-05-10T09:46:49Z

But if the barrier timeouts, we can still not be updated and have no way to find it out

Right, so I think we should swallow the barrier being timed out, and just work with what we have (local schema) to check for dropped keyspace/table.

table_sync_and_check triggers read barrier to synchronize schema and check whether a given table exists. If the barrier fails, e.g. due to timeout, barrier's exception is propagated. Swallow exception and fall back on local schema check in case of barrier failure. Fixes: scylladb#18490.

table_sync_and_check triggers read barrier to synchronize schema and checks whether a given table exists. If the barrier times out, service::raft_operation_timeout_error is propagated. Swallow exception and fall back on local schema check in case of barrier timeout. Fixes: scylladb#18490.

soyacz · 2024-05-13T13:51:18Z

read barrier timeout happened also when replacing node during add-drop column nemesis:

2024-05-12 12:33:46.575 <2024-05-12 12:33:46.481>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=28b31151-2096-4c2d-802c-9f41916770fb during_nemesis=TerminateAndReplaceNode,AddDropColumn: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526 node=parallel-topology-schema-changes-mu-db-node-073570cd-22
2024-05-12T12:33:46.481+00:00 parallel-topology-schema-changes-mu-db-node-073570cd-22      !ERR | scylla[6913]:  [shard 0:main] init - Startup failed: service::raft_operation_timeout_error (group [18715100-0fed-11ef-b82b-6db5dbca6ea9] raft operation [read_barrier] timed out)

Can it be the same?

Packages

Scylla version: 5.5.0~dev-20240510.28791aa2c1d3 with build-id 893c2a68becf3d3bcbbf076980b1b831b9b76e29

Kernel Version: 5.15.0-1060-aws

Installation details

Cluster size: 12 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

parallel-topology-schema-changes-mu-db-node-073570cd-9 (35.179.74.159 | 10.3.11.84) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-8 (18.170.45.142 | 10.3.10.116) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-7 (13.40.167.163 | 10.3.8.204) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-6 (54.217.131.124 | 10.4.9.104) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-5 (52.214.96.24 | 10.4.8.209) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-4 (52.211.178.138 | 10.4.10.241) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-3 (54.171.53.112 | 10.4.8.213) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-22 (3.249.54.85 | 10.4.8.19) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-21 (35.177.195.76 | 10.3.8.19) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-20 (3.252.75.186 | 10.4.10.186) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-2 (54.247.18.121 | 10.4.11.145) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-19 (13.40.16.198 | 10.3.10.45) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-18 (18.171.247.192 | 10.3.11.53) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-17 (18.171.205.64 | 10.3.10.204) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-16 (35.178.49.116 | 10.3.8.209) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-15 (34.244.129.171 | 10.4.8.157) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-14 (3.255.158.84 | 10.4.8.131) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-13 (3.254.85.235 | 10.4.11.100) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-12 (3.10.9.135 | 10.3.9.139) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-11 (18.134.7.128 | 10.3.10.108) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-10 (18.170.50.64 | 10.3.8.100) (shards: 7)
parallel-topology-schema-changes-mu-db-node-073570cd-1 (3.253.60.148 | 10.4.8.4) (shards: 7)

OS / Image: ami-0b7480423a402aa95 ami-044c35ee6970271fc (aws: undefined_region)

Test: longevity-multidc-schema-topology-changes-12h-test
Test id: 073570cd-26dd-48aa-9601-5f141fd862ba
Test name: scylla-master/tier1/longevity-multidc-schema-topology-changes-12h-test
Test config file(s):

longevity-multidc-parallel-topology-schema-changes-12h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 073570cd-26dd-48aa-9601-5f141fd862ba
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 073570cd-26dd-48aa-9601-5f141fd862ba

Logs:

db-cluster-073570cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/073570cd-26dd-48aa-9601-5f141fd862ba/20240512_131116/db-cluster-073570cd.tar.gz
sct-runner-events-073570cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/073570cd-26dd-48aa-9601-5f141fd862ba/20240512_131116/sct-runner-events-073570cd.tar.gz
sct-073570cd.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/073570cd-26dd-48aa-9601-5f141fd862ba/20240512_131116/sct-073570cd.log.tar.gz
loader-set-073570cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/073570cd-26dd-48aa-9601-5f141fd862ba/20240512_131116/loader-set-073570cd.tar.gz
monitor-set-073570cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/073570cd-26dd-48aa-9601-5f141fd862ba/20240512_131116/monitor-set-073570cd.tar.gz

Jenkins job URL
Argus

denesb · 2024-05-14T04:40:04Z

Can it be the same?

Yes, it looks like it is the same. @tgrabiec said we cannot just swallow this error, so not sure how should we proceed here. Do we need to investigate each such timeout?

tgrabiec · 2024-05-14T15:51:40Z

What's the timeout currently? Maybe it's too short.

Deexie · 2024-05-15T09:33:31Z

What's the timeout currently? Maybe it's too short.

It's 10 seconds

tgrabiec · 2024-05-15T14:19:12Z

We've seen schema changes take more than that in the past, which could explain why barrier times out.

What's the reasoning behind 10s timeout? I think repair doesn't have to time out on group0 barrier at all, provided that it reliably eventually completes. If user wants to give up, he should cancel repair.

Deexie · 2024-05-15T18:02:08Z

It was meant to be a fallback in case half of the cluster was down. But if we do not want to proceed with local data anyway, there is no point. I will delete that.

Delete 10s timeout from read barrier in table_sync_and_check, so that the function always considers all previous group0 changes. Fixes: scylladb#18490.

mykaul added the area/nodetool label May 2, 2024

mykaul assigned denesb May 2, 2024

Deexie mentioned this issue May 10, 2024

replica: swallow barrier exceptions in table_sync_and_check #18613

Closed

1 task

Deexie added a commit to Deexie/scylla that referenced this issue May 20, 2024

repair: drop timeout from table_sync_and_check

baaf004

Delete 10s timeout from read barrier in table_sync_and_check, so that the function always considers all previous group0 changes. Fixes: scylladb#18490.

Deexie linked a pull request May 20, 2024 that will close this issue

repair: drop timeout from table_sync_and_check #18752

Open

1 task

mykaul added this to the 6.0 milestone May 20, 2024

denesb modified the milestones: 6.0, 6.1 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodetool repair command failed with exit code3 during drop keyspace #18490

nodetool repair command failed with exit code3 during drop keyspace #18490

aleksbykov commented May 1, 2024

Logs:

denesb commented May 8, 2024

Deexie commented May 8, 2024

denesb commented May 8, 2024 •

edited

Deexie commented May 10, 2024

denesb commented May 10, 2024

soyacz commented May 13, 2024

Logs:

denesb commented May 14, 2024

tgrabiec commented May 14, 2024

Deexie commented May 15, 2024

tgrabiec commented May 15, 2024

Deexie commented May 15, 2024

nodetool repair command failed with exit code3 during drop keyspace #18490

nodetool repair command failed with exit code3 during drop keyspace #18490

Comments

aleksbykov commented May 1, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

denesb commented May 8, 2024

Deexie commented May 8, 2024

denesb commented May 8, 2024 • edited

Deexie commented May 10, 2024

denesb commented May 10, 2024

soyacz commented May 13, 2024

Packages

Installation details

Logs:

denesb commented May 14, 2024

tgrabiec commented May 14, 2024

Deexie commented May 15, 2024

tgrabiec commented May 15, 2024

Deexie commented May 15, 2024

denesb commented May 8, 2024 •

edited