You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's a failure of a synthetic test implemented as a part of "Randomized Failure Injection for Raft Based Topology" test effort: #16223
The idea of the test is to have a cluster, where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the Raft state machine.
There is a no-op cluster event (30s sleep) and for stop_after_sending_join_node_request and stop_after_bootstrapping_initial_raft_configuration error injections the test is failing because node which SIGSTOPed is not joined the cluster or not failed to bootstrap after receiving SIGCONT:
scylla-5.log:
...
INFO 2024-05-13 03:18:00,246 [shard 0:strm] raft_group0 - setup_group0: joining group 0...
INFO 2024-05-13 03:18:00,247 [shard 0:strm] raft_group0 - server 73c9728d-a75e-46ca-99b0-9e16b09b722d found no local group 0. Discovering...
INFO 2024-05-13 03:18:00,252 [shard 0:strm] raft_group0 - server 73c9728d-a75e-46ca-99b0-9e16b09b722d found group 0 with group id 491c9e00-10d7-11ef-86a4-c6c672e9bd55, leader 04b853a5-a989-4233-9e4b-fc55b48bf69b
INFO 2024-05-13 03:18:00,252 [shard 0:strm] raft_topology - join: sending the join request to 127.108.75.2
INFO 2024-05-13 03:18:00,841 [shard 0:strm] raft_topology - join: request to join placed, waiting for the response from the topology coordinator
INFO 2024-05-13 03:18:31,447 [shard 0:strm] raft_group0 - Server 73c9728d-a75e-46ca-99b0-9e16b09b722d is starting group 0 with id 491c9e00-10d7-11ef-86a4-c6c672e9bd55
DEBUG 2024-05-13 03:18:31,454 [shard 0:strm] raft_topology - reload raft topology state
INFO 2024-05-13 03:18:31,463 [shard 0:strm] raft_group0 - Detected snapshot with index=0, id=ce9b42df-6b70-4e57-bac8-59c9511fa098, triggering new snapshot
...
Logs of other nodes flooded by cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying messages:
scylla-1.log
...
INFO 2024-05-13 03:18:00,345 [shard 0:comp] compaction - [Compact system.topology 66fa0890-10d7-11ef-86a4-c6c672e9bd55] Compacting [/home/vagrant/PycharmProjects/scylladb/testlog/debug/scylla-1/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gg3_0960_1tsts21pmqq2ru9rh1-big-Data.db:level=0:origin=memtable,/home/vagrant/PycharmProjects/scylladb/testlog/debug/scylla-1/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gg3_095f_4ow8w21pmqq2ru9rh1-big-Data.db:level=0:origin=compaction]
DEBUG 2024-05-13 03:18:00,348 [shard 0:strm] raft_topology - reload raft topology state
DEBUG 2024-05-13 03:18:00,475 [shard 0:strm] raft_topology - topology_state_load: the last committed CDC generation ID: (2024/05/13 03:17:42, 5c3fec76-10d7-11ef-73a6-bf82cef18219)
DEBUG 2024-05-13 03:18:00,475 [shard 0:strm] raft_topology - topology coordinator fiber got an event
INFO 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - coordinator fiber found a node to work on id=73c9728d-a75e-46ca-99b0-9e16b09b722d state=none
INFO 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - executing global topology command wait_for_ip, excluded nodes: {73c9728d-a75e-46ca-99b0-9e16b09b722d}
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 9b9f337a-54c2-41f1-8101-6915fb07d72f/127.108.75.3
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 1e7165c0-3d65-4437-b759-919fa7096d69/127.108.75.4
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 0fa67dfe-5d43-4114-9686-bd9595c9363a/127.108.75.1
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - topology cmd rpc wait_for_ip is called
DEBUG 2024-05-13 03:18:00,501 [shard 0:strm] raft_topology - send wait_for_ip command with term 1 and index 20 to 04b853a5-a989-4233-9e4b-fc55b48bf69b/127.108.75.2
DEBUG 2024-05-13 03:18:00,514 [shard 0:strm] raft_topology - Got raft_topology_cmd::wait_for_ip, new nodes [[73c9728d-a75e-46ca-99b0-9e16b09b722d]]
WARN 2024-05-13 03:18:00,514 [shard 0:strm] raft_topology - (rate limiting dropped 166 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
INFO 2024-05-13 03:18:00,585 [shard 0:comp] compaction - [Compact system.topology 66fa0890-10d7-11ef-86a4-c6c672e9bd55] Compacted 2 sstables to [/home/vagrant/PycharmProjects/scylladb/testlog/debug/scylla-1/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gg3_0960_2gb0g21pmqq2ru9rh1-big-Data.db:level=0]. 21kB to 13kB (~64% of original) in 224ms = 96kB/s. ~256 total partitions merged to 1.
WARN 2024-05-13 03:18:01,518 [shard 0:strm] raft_topology - (rate limiting dropped 148 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:02,524 [shard 0:strm] raft_topology - (rate limiting dropped 170 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:03,529 [shard 0:strm] raft_topology - (rate limiting dropped 178 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:04,530 [shard 0:strm] raft_topology - (rate limiting dropped 181 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:05,530 [shard 0:strm] raft_topology - (rate limiting dropped 174 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:06,533 [shard 0:strm] raft_topology - (rate limiting dropped 178 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:07,535 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:08,535 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:09,540 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:10,544 [shard 0:strm] raft_topology - (rate limiting dropped 184 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:11,547 [shard 0:strm] raft_topology - (rate limiting dropped 185 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
INFO 2024-05-13 03:18:11,553 [shard 0:strm] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 4 endpoints
DEBUG 2024-05-13 03:18:11,616 [shard 0:strm] raft_topology - raft topology: Refreshed table load stats for DC datacenter1, table=633ed550-10d7-11ef-b7dc-281846dedb04, RF=3, size_in_bytes=0, split_ready_seq_number=-9223372036854775808
DEBUG 2024-05-13 03:18:11,616 [shard 0:strm] raft_topology - raft topology: Refreshed table load stats for all DC(s).
WARN 2024-05-13 03:18:12,551 [shard 0:strm] raft_topology - (rate limiting dropped 175 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:13,554 [shard 0:strm] raft_topology - (rate limiting dropped 184 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:14,559 [shard 0:strm] raft_topology - (rate limiting dropped 182 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:15,561 [shard 0:strm] raft_topology - (rate limiting dropped 179 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:16,564 [shard 0:strm] raft_topology - (rate limiting dropped 188 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:17,565 [shard 0:strm] raft_topology - (rate limiting dropped 170 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:18,567 [shard 0:strm] raft_topology - (rate limiting dropped 162 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:19,569 [shard 0:strm] raft_topology - (rate limiting dropped 172 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:20,573 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:21,575 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:22,580 [shard 0:strm] raft_topology - (rate limiting dropped 174 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:23,583 [shard 0:strm] raft_topology - (rate limiting dropped 183 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:24,587 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:25,589 [shard 0:strm] raft_topology - (rate limiting dropped 179 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:26,590 [shard 0:strm] raft_topology - (rate limiting dropped 185 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:27,591 [shard 0:strm] raft_topology - (rate limiting dropped 180 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:28,595 [shard 0:strm] raft_topology - (rate limiting dropped 181 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
WARN 2024-05-13 03:18:29,599 [shard 0:strm] raft_topology - (rate limiting dropped 182 similar messages) cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying.
ERROR 2024-05-13 03:18:30,546 [shard 0:strm] raft_topology - raft_topology_cmd failed with: service::wait_for_ip_timeout (failed to obtain an IP for 73c9728d-a75e-46ca-99b0-9e16b09b722d in 30s)
WARN 2024-05-13 03:18:30,832 [shard 0:strm] raft_topology - raft_topology_cmd::command::wait_for_ip failed, error std::runtime_error (raft topology: exec_global_command(wait_for_ip) failed with std::runtime_error (failed status returned from 9b9f337a-54c2-41f1-8101-6915fb07d72f/127.108.75.3))
INFO 2024-05-13 03:18:30,836 [shard 0:strm] raft_topology - updating topology state: bootstrap: node rejected
DEBUG 2024-05-13 03:18:30,847 [shard 0:strm] raft_topology - reload raft topology state
DEBUG 2024-05-13 03:18:30,879 [shard 0:strm] raft_topology - topology_state_load: the last committed CDC generation ID: (2024/05/13 03:17:42, 5c3fec76-10d7-11ef-73a6-bf82cef18219)
INFO 2024-05-13 03:18:30,881 [shard 0:strm] raft_topology - rejected node moved to left state 73c9728d-a75e-46ca-99b0-9e16b09b722d
WARN 2024-05-13 03:18:30,881 [shard 0:strm] raft_topology - attempt to send rejection response to 73c9728d-a75e-46ca-99b0-9e16b09b722d failed. The node may hang. It's safe to shut it down manually now. Error: no ip address mapping for 73c9728d-a75e-46ca-99b0-9e16b09b722d
...
Have the same problem for add_new_node, decommission_node, and remove_node cluster events (same error injections) in my environment because these cluster events take more than 30s.
To reproduce these specific failures you need to checkout the PR and change CLUSTER_EVENTS and ERROR_INJECTIONS tuples (in test/topology_experimental_raft/cluster_events.py and test/topology_experimental_raft/error_injections.py files correspondingly) to run just required combination.
It's a failure of a synthetic test implemented as a part of "Randomized Failure Injection for Raft Based Topology" test effort: #16223
The idea of the test is to have a cluster, where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the Raft state machine.
There is a no-op cluster event (30s sleep) and for
stop_after_sending_join_node_request
andstop_after_bootstrapping_initial_raft_configuration
error injections the test is failing because node which SIGSTOPed is not joined the cluster or not failed to bootstrap after receiving SIGCONT:scylla-5.log:
Logs of other nodes flooded by
cannot map 73c9728d-a75e-46ca-99b0-9e16b09b722d to ip, retrying
messages:scylla-1.log
Have the same problem for
add_new_node
,decommission_node
, andremove_node
cluster events (same error injections) in my environment because these cluster events take more than 30s.To reproduce these specific failures you need to checkout the PR and change
CLUSTER_EVENTS
andERROR_INJECTIONS
tuples (intest/topology_experimental_raft/cluster_events.py
andtest/topology_experimental_raft/error_injections.py
files correspondingly) to run just required combination.testlog.tar.gz
The text was updated successfully, but these errors were encountered: