Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft topology error injections: a node rejected because of wait_for_ip_timeout is not failed to bootstrap and continue to run #18638

Open
enaydanov opened this issue May 13, 2024 · 0 comments

Comments

@enaydanov
Copy link
Contributor

enaydanov commented May 13, 2024

It's a failure of a synthetic test implemented as a part of "Randomized Failure Injection for Raft Based Topology" test effort: #16223

The idea of the test is to have a cluster, where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the Raft state machine.

There is a no-op cluster event (30s sleep) and for stop_after_sending_join_node_request and stop_after_bootstrapping_initial_raft_configuration error injections the test is failing because node which SIGSTOPed is not joined the cluster or not failed to bootstrap after receiving SIGCONT:

scylla-5.log:

...
INFO  2024-05-13 03:18:00,246 [shard 0:strm] raft_group0 - setup_group0: joining group 0...
INFO  2024-05-13 03:18:00,247 [shard 0:strm] raft_group0 - server 73c9728d-a75e-46ca-99b0-9e16b09b722d found no local group 0. Discovering...
INFO  2024-05-13 03:18:00,252 [shard 0:strm] raft_group0 - server 73c9728d-a75e-46ca-99b0-9e16b09b722d found group 0 with group id 491c9e00-10d7-11ef-86a4-c6c672e9bd55, leader 04b853a5-a989-4233-9e4b-fc55b48bf69b
INFO  2024-05-13 03:18:00,252 [shard 0:strm] raft_topology - join: sending the join request to 127.108.75.2
INFO  2024-05-13 03:18:00,841 [shard 0:strm] raft_topology - join: request to join placed, waiting for the response from the topology coordinator
INFO  2024-05-13 03:18:31,447 [shard 0:strm] raft_group0 - Server 73c9728d-a75e-46ca-99b0-9e16b09b722d is starting group 0 with id 491c9e00-10d7-11ef-86a4-c6c672e9bd55
DEBUG 2024-05-13 03:18:31,454 [shard 0:strm] raft_topology - reload raft topology state
INFO  2024-05-13 03:18:31,463 [shard 0:strm] raft_group0 - Detected snapshot with index=0, id=ce9b42df-6b70-4e57-bac8-59c9511fa098, triggering new snapshot
...

Logs of other nodes flooded by cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying messages:

scylla-1.log

...
INFO  2024-05-13 03:18:00,345 [shard 0:comp] compaction - [Compact system.topology 66fa0890-10d7-11ef-86a4-c6c672e9bd55] Compacting [/home/vagrant/PycharmProjects/scylladb/testlog/debug/scylla-1/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gg3_0960_1tsts21pmqq2ru9rh1-big-Data.db:level=0:origin=memtable,/home/vagrant/PycharmProjects/scylladb/testlog/debug/scylla-1/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gg3_095f_4ow8w21pmqq2ru9rh1-big-Data.db:level=0:origin=compaction]
DEBUG 2024-05-13 03:18:00,348 [shard 0:strm] raft_topology - reload raft topology state
DEBUG 2024-05-13 03:18:00,475 [shard 0:strm] raft_topology - topology_state_load: the last committed CDC generation ID: (2024/05/13 03:17:42, 5c3fec76-10d7-11ef-73a6-bf82cef18219)
DEBUG 2024-05-13 03:18:00,475 [shard 0:strm] raft_topology - topology coordinator fiber got an event
INFO  2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - coordinator fiber found a node to work on id=73c9728d-a75e-46ca-99b0-9e16b09b722d state=none
INFO  2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - executing global topology command wait_for_ip, excluded nodes: {73c9728d-a75e-46ca-99b0-9e16b09b722d}
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 9b9f337a-54c2-41f1-8101-6915fb07d72f/127.108.75.3
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 1e7165c0-3d65-4437-b759-919fa7096d69/127.108.75.4
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 0fa67dfe-5d43-4114-9686-bd9595c9363a/127.108.75.1
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - topology cmd rpc wait_for_ip is called
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 04b853a5-a989-4233-9e4b-fc55b48bf69b/127.108.75.2
DEBUG 2024-05-13 03:18:00,514 [shard 0:strm] raft_topology - Got raft_topology_cmd::wait_for_ip, new nodes [[73c9728d-a75e-46ca-99b0-9e16b09b722d]]
WARN  2024-05-13 03:18:00,514 [shard 0:strm] raft_topology - (rate limiting dropped 166 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
INFO  2024-05-13 03:18:00,585 [shard 0:comp] compaction - [Compact system.topology 66fa0890-10d7-11ef-86a4-c6c672e9bd55] Compacted 2 sstables to [/home/vagrant/PycharmProjects/scylladb/testlog/debug/scylla-1/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gg3_0960_2gb0g21pmqq2ru9rh1-big-Data.db:level=0]. 21kB to 13kB (~64% of original) in 224ms = 96kB/s. ~256 total partitions merged to 1.
WARN  2024-05-13 03:18:01,518 [shard 0:strm] raft_topology - (rate limiting dropped 148 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:02,524 [shard 0:strm] raft_topology - (rate limiting dropped 170 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:03,529 [shard 0:strm] raft_topology - (rate limiting dropped 178 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:04,530 [shard 0:strm] raft_topology - (rate limiting dropped 181 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:05,530 [shard 0:strm] raft_topology - (rate limiting dropped 174 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:06,533 [shard 0:strm] raft_topology - (rate limiting dropped 178 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:07,535 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:08,535 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:09,540 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:10,544 [shard 0:strm] raft_topology - (rate limiting dropped 184 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:11,547 [shard 0:strm] raft_topology - (rate limiting dropped 185 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
INFO  2024-05-13 03:18:11,553 [shard 0:strm] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 4 endpoints
DEBUG 2024-05-13 03:18:11,616 [shard 0:strm] raft_topology - raft topology: Refreshed table load stats for DC datacenter1, table=633ed550-10d7-11ef-b7dc-281846dedb04, RF=3, size_in_bytes=0, split_ready_seq_number=-9223372036854775808
DEBUG 2024-05-13 03:18:11,616 [shard 0:strm] raft_topology - raft topology: Refreshed table load stats for all DC(s).
WARN  2024-05-13 03:18:12,551 [shard 0:strm] raft_topology - (rate limiting dropped 175 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:13,554 [shard 0:strm] raft_topology - (rate limiting dropped 184 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:14,559 [shard 0:strm] raft_topology - (rate limiting dropped 182 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:15,561 [shard 0:strm] raft_topology - (rate limiting dropped 179 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:16,564 [shard 0:strm] raft_topology - (rate limiting dropped 188 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:17,565 [shard 0:strm] raft_topology - (rate limiting dropped 170 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:18,567 [shard 0:strm] raft_topology - (rate limiting dropped 162 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:19,569 [shard 0:strm] raft_topology - (rate limiting dropped 172 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:20,573 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:21,575 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:22,580 [shard 0:strm] raft_topology - (rate limiting dropped 174 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:23,583 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:24,587 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:25,589 [shard 0:strm] raft_topology - (rate limiting dropped 179 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:26,590 [shard 0:strm] raft_topology - (rate limiting dropped 185 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:27,591 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:28,595 [shard 0:strm] raft_topology - (rate limiting dropped 181 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN  2024-05-13 03:18:29,599 [shard 0:strm] raft_topology - (rate limiting dropped 182 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
ERROR 2024-05-13 03:18:30,546 [shard 0:strm] raft_topology - raft_topology_cmd failed with: service::wait_for_ip_timeout (failed to obtain an IP for 73c9728d-a75e-46ca-99b0-9e16b09b722d in 30s)
WARN  2024-05-13 03:18:30,832 [shard 0:strm] raft_topology - raft_topology_cmd::command::wait_for_ip failed, error std::runtime_error (raft topology: exec_global_command(wait_for_ip) failed with std::runtime_error (failed status returned from 9b9f337a-54c2-41f1-8101-6915fb07d72f/127.108.75.3))
INFO  2024-05-13 03:18:30,836 [shard 0:strm] raft_topology - updating topology state: bootstrap: node rejected
DEBUG 2024-05-13 03:18:30,847 [shard 0:strm] raft_topology - reload raft topology state
DEBUG 2024-05-13 03:18:30,879 [shard 0:strm] raft_topology - topology_state_load: the last committed CDC generation ID: (2024/05/13 03:17:42, 5c3fec76-10d7-11ef-73a6-bf82cef18219)
INFO  2024-05-13 03:18:30,881 [shard 0:strm] raft_topology - rejected node moved to left state 73c9728d-a75e-46ca-99b0-9e16b09b722d
WARN  2024-05-13 03:18:30,881 [shard 0:strm] raft_topology - attempt to send rejection response to 73c9728d-a75e-46ca-99b0-9e16b09b722d failed. The node may hang. It's safe to shut it down manually now. Error: no ip address mapping for 73c9728d-a75e-46ca-99b0-9e16b09b722d
...

Have the same problem for add_new_node, decommission_node, and remove_node cluster events (same error injections) in my environment because these cluster events take more than 30s.

To reproduce these specific failures you need to checkout the PR and change CLUSTER_EVENTS and ERROR_INJECTIONS tuples (in test/topology_experimental_raft/cluster_events.py and test/topology_experimental_raft/error_injections.py files correspondingly) to run just required combination.

testlog.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant