Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodetool repair command failed with exit code3 during drop keyspace #18490

Open
1 of 2 tasks
aleksbykov opened this issue May 1, 2024 · 11 comments · May be fixed by #18752
Open
1 of 2 tasks

nodetool repair command failed with exit code3 during drop keyspace #18490

aleksbykov opened this issue May 1, 2024 · 11 comments · May be fixed by #18752
Assignees
Milestone

Comments

@aleksbykov
Copy link
Contributor

Packages

Scylla version: 5.5.0~dev-20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b

Kernel Version: 5.15.0-1060-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Issue looks like a regression.

During nemesis no_corrupt_repair, nodetool repair is started in background and while it running keyspaces are dropped.

The drop keyspace failed with error:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1594, in disrupt_no_corrupt_repair
    session.execute(SimpleStatement(
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1824, in execute_verbose
    return execute_orig(*args, **kwargs)
  File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 5018, in cassandra.cluster.ResponseFuture.result
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot drop non existing table 'standard1' in keyspace 'drop_table_during_repair_ks_8'."

Similar to issue #18479, but also
nodetool repair failed with error:

ommand: '/usr/bin/nodetool  repair '

Exit code: 3

Stdout:

[2024-04-20 21:54:22,765] Repair session 5
[2024-04-20 21:54:25,387] Repair session 5 finished
[2024-04-20 21:54:25,392] Starting repair command #6, repairing 1 ranges for keyspace drop_table_during_repair_ks_6 (parallelism=SEQUENTIAL, full=true)
[2024-04-20 21:54:25,392] Repair session 6
[2024-04-20 21:54:27,813] Repair session 6 finished
[2024-04-20 21:54:27,817] Starting repair command #7, repairing 1 ranges for keyspace drop_table_during_repair_ks_9 (parallelism=SEQUENTIAL, full=true)
[2024-04-20 21:54:27,817] Repair session 7
[2024-04-20 21:54:30,438] Repair session 7 finished
[2024-04-20 21:54:30,440] Starting repair command #8, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)
[2024-04-20 21:54:30,440] Repair session 8

Stderr:

Repair session 8 failed

and messages found in log of node 1:

2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:stmt] migration_manager - Drop table 'drop_table_during_repair_ks_8.standard1'
2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] schema_tables - Dropping drop_table_during_repair_ks_8.standard1 id=473b5c80-ff60-11ee-be67-1bff0286bc41 version=473b5c81-ff60-11ee-be67-1bff0286bc41
2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] database - Dropping drop_table_during_repair_ks_8.standard1 with auto-snapshot
2024-04-20T22:19:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] database - Truncating drop_table_during_repair_ks_8.standard1 with auto-snapshot


2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] database - Truncated drop_table_during_repair_ks_8.standard1
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 5:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 2:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 4:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.706+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 1:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.707+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 6:main] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table drop_table_during_repair_ks_8.standard1 compaction_group=0 due to table removal
2024-04-20T22:20:09.707+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 2:comp] compaction - [Compact system.truncated 261a14c0-ff64-11ee-b6cb-e72f0264c52b] Compacted 2 sstables to [/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gfg_1q1l_3n6812s2nk97nu7p63-big-Data.db:level=0]. 28kB to 22kB (~80% of original) in 31ms = 908kB/s. ~256 total partitions merged to 2.
2024-04-20T22:20:27.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1   !NOTICE | sudo[29145]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2024-04-20T22:20:27.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | sudo[29145]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2024-04-20T22:20:27.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | sudo[29145]: pam_unix(sudo:session): session closed for user root
2024-04-20T22:20:30.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1  !WARNING | scylla[18995]:  [shard 0:stmt] raft_group_registry - group [052c5d30-fec5-11ee-9de8-51e611cfd09d] raft operation [add_entry] timed out; timeout requested at [service/migration_manager.cc(972:109) `future<> service::migration_manager::announce_with_raft(std::vector<mutation>, group0_guard, std::string_view)`], original error raft::request_aborted (Request is aborted by a caller)
2024-04-20T22:20:32.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] migration_manager - Gossiping my schema version 0ed04c6c-ff64-11ee-cfaa-7002afcf9bf5
2024-04-20T22:20:32.955+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 0:main] schema_tables - Schema version changed to 0ed04c6c-ff64-11ee-cfaa-7002afcf9bf5
2024-04-20T22:20:37.705+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 6:strm] repair - repair[53e665c5-7f37-4502-9837-020eabbdbb48]: Started to repair 2 out of 2 tables in keyspace=scylla_bench, table=test_counters, table_id=42977d90-fec9-11ee-8e05-18b903b642ca, repair_reason=repair
2024-04-20T22:20:38.455+00:00 longevity-twcs-48h-master-db-node-e5517f29-1     !INFO | scylla[18995]:  [shard 6:strm] repair - repair[53e665c5-7f37-4502-9837-020eabbdbb48]: stats: repair_reason=repair, keyspace=scylla_bench, tables={test, test_counters}, ranges_nr=798, round_nr=3799, round_nr_fast_path_already_synced=3628, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=171, rpc_call_nr=21042, tx_hashes_nr=2107, rx_hashes_nr=29516253, duration=1567.6217 seconds, tx_row_nr=3863, rx_row_nr=2107, tx_row_bytes=1993308, rx_row_bytes=2075395, row_from_disk_bytes={{10.4.8.101, 40136524572}, {10.4.9.183, 42518831460}, {10.4.9.17, 61499014392}, {10.4.10.107, 40342853868}}, row_from_disk_nr={{10.4.8.101, 118068917}, {10.4.9.183, 124976285}, {10.4.9.17, 180975837}, {10.4.10.107, 118906823}}, row_from_disk_bytes_per_sec={{10.4.8.101, 24.4174}, {10.4.9.183, 25.8666}, {10.4.9.17, 37.4134}, {10.4.10.107, 24.5429}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.101, 75317.2}, {10.4.9.183, 79723.5}, {10.4.9.17, 115446}, {10.4.10.107, 75851.7}} Rows/s, tx_row_nr_peer={{10.4.8.101, 1540}, {10.4.9.183, 1207}, {10.4.10.107, 1116}}, rx_row_nr_peer={{10.4.8.101, 1395}, {10.4.9.183, 208}, {10.4.10.107, 504}}

2024-04-20T22:38:00.455+00:00 longevity-twcs-48h-master-db-node-e5517f29-1  !WARNING | scylla[18995]:  [shard 0:strm] repair - repair[53e665c5-7f37-4502-9837-020eabbdbb48]: user-requested repair failed: std::runtime_error ({shard 5: std::runtime_error (repair[53e665c5-7f37-4502-9837-020eabbdbb48]: 0 out of 1596 ranges failed, keyspace=scylla_bench, tables={test, test_counters}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=false, failed_because=service::raft_operation_timeout_error (group [052c5d30-fec5-11ee-9de8-51e611cfd09d] raft operation [read_barrier] timed out))})

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-e5517f29-5 (34.240.166.199 | 10.4.10.107) (shards: 7)
  • longevity-twcs-48h-master-db-node-e5517f29-4 (52.210.172.137 | 10.4.9.76) (shards: 7)
  • longevity-twcs-48h-master-db-node-e5517f29-3 (54.75.94.63 | 10.4.8.101) (shards: 7)
  • longevity-twcs-48h-master-db-node-e5517f29-2 (18.201.32.18 | 10.4.9.183) (shards: 7)
  • longevity-twcs-48h-master-db-node-e5517f29-1 (54.194.79.51 | 10.4.9.17) (shards: 7)

OS / Image: ami-00f7c6bc1f946636b (aws: undefined_region)

Test: longevity-twcs-48h-test
Test id: e5517f29-ad39-4fcf-8203-8a8404416893
Test name: scylla-master/tier1/longevity-twcs-48h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor e5517f29-ad39-4fcf-8203-8a8404416893
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e5517f29-ad39-4fcf-8203-8a8404416893

Logs:

Jenkins job URL
Argus

@denesb
Copy link
Contributor

denesb commented May 8, 2024

@asias @Deexie what raft operation times out during repair? Is this the raft barrier we execute to check whether the schema is present?

@Deexie
Copy link
Contributor

Deexie commented May 8, 2024

@asias @Deexie what raft operation times out during repair? Is this the raft barrier we execute to check whether the schema is present?

Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error. I think the raft error should be caught and the original one rethrown, right?

@denesb
Copy link
Contributor

denesb commented May 8, 2024

Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error.

Possibly yes.

I think the raft error should be caught and the original one rethrown, right?

The original error is probably due to the keyspace being dropped. In which case we don't want to raise it either.

@Deexie
Copy link
Contributor

Deexie commented May 10, 2024

Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error.

Possibly yes.

I think the raft error should be caught and the original one rethrown, right?

The original error is probably due to the keyspace being dropped. In which case we don't want to raise it either.

But if the barrier timeouts, we can still not be updated and have no way to find it out

@denesb
Copy link
Contributor

denesb commented May 10, 2024

But if the barrier timeouts, we can still not be updated and have no way to find it out

Right, so I think we should swallow the barrier being timed out, and just work with what we have (local schema) to check for dropped keyspace/table.

Deexie added a commit to Deexie/scylla that referenced this issue May 10, 2024
table_sync_and_check triggers read barrier to synchronize schema
and check whether a given table exists. If the barrier fails, e.g.
due to timeout, barrier's exception is propagated.

Swallow exception and fall back on local schema check in case of barrier
failure.

Fixes: scylladb#18490.
Deexie added a commit to Deexie/scylla that referenced this issue May 10, 2024
table_sync_and_check triggers read barrier to synchronize schema
and checks whether a given table exists. If the barrier times out,
service::raft_operation_timeout_error is propagated.

Swallow exception and fall back on local schema check in case of barrier
timeout.

Fixes: scylladb#18490.
Deexie added a commit to Deexie/scylla that referenced this issue May 10, 2024
table_sync_and_check triggers read barrier to synchronize schema
and checks whether a given table exists. If the barrier times out,
service::raft_operation_timeout_error is propagated.

Swallow exception and fall back on local schema check in case of barrier
timeout.

Fixes: scylladb#18490.
@soyacz
Copy link
Contributor

soyacz commented May 13, 2024

read barrier timeout happened also when replacing node during add-drop column nemesis:

2024-05-12 12:33:46.575 <2024-05-12 12:33:46.481>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=28b31151-2096-4c2d-802c-9f41916770fb during_nemesis=TerminateAndReplaceNode,AddDropColumn: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526 node=parallel-topology-schema-changes-mu-db-node-073570cd-22
2024-05-12T12:33:46.481+00:00 parallel-topology-schema-changes-mu-db-node-073570cd-22      !ERR | scylla[6913]:  [shard 0:main] init - Startup failed: service::raft_operation_timeout_error (group [18715100-0fed-11ef-b82b-6db5dbca6ea9] raft operation [read_barrier] timed out)

Can it be the same?

Packages

Scylla version: 5.5.0~dev-20240510.28791aa2c1d3 with build-id 893c2a68becf3d3bcbbf076980b1b831b9b76e29

Kernel Version: 5.15.0-1060-aws

Installation details

Cluster size: 12 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • parallel-topology-schema-changes-mu-db-node-073570cd-9 (35.179.74.159 | 10.3.11.84) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-8 (18.170.45.142 | 10.3.10.116) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-7 (13.40.167.163 | 10.3.8.204) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-6 (54.217.131.124 | 10.4.9.104) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-5 (52.214.96.24 | 10.4.8.209) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-4 (52.211.178.138 | 10.4.10.241) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-3 (54.171.53.112 | 10.4.8.213) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-22 (3.249.54.85 | 10.4.8.19) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-21 (35.177.195.76 | 10.3.8.19) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-20 (3.252.75.186 | 10.4.10.186) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-2 (54.247.18.121 | 10.4.11.145) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-19 (13.40.16.198 | 10.3.10.45) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-18 (18.171.247.192 | 10.3.11.53) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-17 (18.171.205.64 | 10.3.10.204) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-16 (35.178.49.116 | 10.3.8.209) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-15 (34.244.129.171 | 10.4.8.157) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-14 (3.255.158.84 | 10.4.8.131) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-13 (3.254.85.235 | 10.4.11.100) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-12 (3.10.9.135 | 10.3.9.139) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-11 (18.134.7.128 | 10.3.10.108) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-10 (18.170.50.64 | 10.3.8.100) (shards: 7)
  • parallel-topology-schema-changes-mu-db-node-073570cd-1 (3.253.60.148 | 10.4.8.4) (shards: 7)

OS / Image: ami-0b7480423a402aa95 ami-044c35ee6970271fc (aws: undefined_region)

Test: longevity-multidc-schema-topology-changes-12h-test
Test id: 073570cd-26dd-48aa-9601-5f141fd862ba
Test name: scylla-master/tier1/longevity-multidc-schema-topology-changes-12h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 073570cd-26dd-48aa-9601-5f141fd862ba
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 073570cd-26dd-48aa-9601-5f141fd862ba

Logs:

Jenkins job URL
Argus

@denesb
Copy link
Contributor

denesb commented May 14, 2024

Can it be the same?

Yes, it looks like it is the same. @tgrabiec said we cannot just swallow this error, so not sure how should we proceed here. Do we need to investigate each such timeout?

@tgrabiec
Copy link
Contributor

What's the timeout currently? Maybe it's too short.

@Deexie
Copy link
Contributor

Deexie commented May 15, 2024

What's the timeout currently? Maybe it's too short.

It's 10 seconds

@tgrabiec
Copy link
Contributor

We've seen schema changes take more than that in the past, which could explain why barrier times out.

What's the reasoning behind 10s timeout? I think repair doesn't have to time out on group0 barrier at all, provided that it reliably eventually completes. If user wants to give up, he should cancel repair.

@Deexie
Copy link
Contributor

Deexie commented May 15, 2024

It was meant to be a fallback in case half of the cluster was down. But if we do not want to proceed with local data anyway, there is no point. I will delete that.

Deexie added a commit to Deexie/scylla that referenced this issue May 20, 2024
Delete 10s timeout from read barrier in table_sync_and_check,
so that the function always considers all previous group0 changes.

Fixes: scylladb#18490.
@Deexie Deexie linked a pull request May 20, 2024 that will close this issue
1 task
@mykaul mykaul added this to the 6.0 milestone May 20, 2024
@denesb denesb modified the milestones: 6.0, 6.1 May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants