Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft topology error injections: failed to add a node to a cluster if another bootstrapping node is stuck #18640

Open
enaydanov opened this issue May 13, 2024 · 0 comments

Comments

@enaydanov
Copy link
Contributor

It's a failure of a synthetic test implemented as a part of "Randomized Failure Injection for Raft Based Topology" test effort: #16223

The idea of the test is to have a cluster, where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the Raft state machine.

add_new_node cluster event failed to add a node to a cluster if used with following error injections:

  • stop_after_starting_auth_service
  • stop_after_setting_mode_to_normal
  • stop_before_becoming_raft_voter
  • stop_after_updating_cdc_generation
  • stop_before_streaming
  • stop_after_streaming

scylla-6.log (a new node the test is trying to add while node-5 is SIGSTOPed):

...
INFO  2024-05-13 08:08:58,689 [shard 0:strm] raft_group0 - setup_group0: joining group 0...
INFO  2024-05-13 08:08:58,690 [shard 0:strm] raft_group0 - server bb19bc99-4102-43ee-a4a0-c6cc41b6b1bc found no local group 0. Discovering...
INFO  2024-05-13 08:08:58,693 [shard 0:strm] raft_group0 - server bb19bc99-4102-43ee-a4a0-c6cc41b6b1bc found group 0 with group id e442c260-10ff-11ef-ad2b-a8475d9bb3ee, leader 40bc108d-f62e-4209-8d86-7d6485bf3028
INFO  2024-05-13 08:08:58,693 [shard 0:strm] raft_topology - join: sending the join request to 127.193.106.2
INFO  2024-05-13 08:08:58,900 [shard 0:strm] raft_topology - join: request to join placed, waiting for the response from the topology coordinator
INFO  2024-05-13 08:08:58,914 [shard 0:strm] raft_group0 - Server bb19bc99-4102-43ee-a4a0-c6cc41b6b1bc is starting group 0 with id e442c260-10ff-11ef-ad2b-a8475d9bb3ee
DEBUG 2024-05-13 08:08:58,924 [shard 0:strm] raft_topology - reload raft topology state
INFO  2024-05-13 08:08:58,939 [shard 0:strm] raft_group0 - Detected snapshot with index=0, id=d3238a33-19dc-4749-bf24-6c4348fe7c61, triggering new snapshot
WARN  2024-05-13 08:08:58,939 [shard 0:strm] raft_group0 - Could not create new snapshot, there are no entries applied
INFO  2024-05-13 08:09:00,002 [shard 0: gms] gossip - InetAddress 40bc108d-f62e-4209-8d86-7d6485bf3028/127.193.106.2 is now UP, status = NORMAL
INFO  2024-05-13 08:09:00,005 [shard 0: gms] gossip - InetAddress 07803579-905d-44ed-b762-db7c5c172b03/127.193.106.3 is now UP, status = NORMAL
INFO  2024-05-13 08:09:00,006 [shard 0: gms] gossip - InetAddress 580d775b-3e0b-4dcb-ac8f-0e8eb918ce2e/127.193.106.4 is now UP, status = NORMAL
INFO  2024-05-13 08:09:00,008 [shard 0: gms] gossip - InetAddress b04bc736-88ec-4dcc-b839-a51f9de76b57/127.193.106.1 is now UP, status = NORMAL
WARN  2024-05-13 08:09:14,998 [shard 0: gms] gossip - Fail to send EchoMessage to 127.193.106.5: seastar::rpc::timeout_error (rpc call timed out)

After the last message the node-6 just do nothing.

To reproduce these specific failures you need to checkout the PR and change CLUSTER_EVENTS and ERROR_INJECTIONS tuples (in test/topology_experimental_raft/cluster_events.py and test/topology_experimental_raft/error_injections.py files correspondingly) to run just required combination.

add_node.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant