Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_topology_streaming_failure is flaky depending on where the topology coordinator runs #18614

Open
tgrabiec opened this issue May 10, 2024 · 1 comment
Assignees
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework

Comments

@tgrabiec
Copy link
Contributor

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8667/testReport/junit/(root)/non-boost%20tests/Tests___Unit_Tests___topology_test_topology_failure_recovery_dev_1/

=================================== FAILURES ===================================
_______________________ test_topology_streaming_failure ________________________

request = <FixtureRequest for <Function test_topology_streaming_failure>>
manager = <test.pylib.manager_client.ManagerClient object at 0x7f0422e2c710>

    @pytest.mark.asyncio
    @skip_mode('release', 'error injections are not supported in release mode')
    async def test_topology_streaming_failure(request, manager: ManagerClient):
        """Fail streaming while doing a topology operation"""
        # decommission failure
        servers = await manager.running_servers()
        logs = [await manager.server_open_log(srv.server_id) for srv in servers]
        marks = [await log.mark() for log in logs]
        await manager.api.enable_injection(servers[2].ip_addr, 'stream_ranges_fail', one_shot=True)
        await manager.decommission_node(servers[2].server_id, expected_error="Decommission failed. See earlier errors")
        servers = await manager.running_servers()
        assert len(servers) == 3
        matches = [await log.grep("raft_topology - rollback.*after decommissioning failure, moving transition state to rollback to normal",
                   from_mark=mark) for log, mark in zip(logs, marks)]
        assert sum(len(x) for x in matches) == 1
        # bootstrap failure
        marks = [await log.mark() for log in logs]
        servers = await manager.running_servers()
        s = await manager.server_add(start=False, config={
            'error_injections_at_startup': ['stream_ranges_fail']
        })
        await manager.server_start(s.server_id, expected_error="Bootstrap failed. See earlier errors")
        servers = await manager.running_servers()
        assert s not in servers
        matches = [await log.grep("raft_topology - rollback.*after bootstrapping failure, moving transition state to left token ring",
                   from_mark=mark) for log, mark in zip(logs, marks)]
        assert sum(len(x) for x in matches) == 1
        # bootstrap failure in raft barrier
        marks = [await log.mark() for log in logs]
        servers = await manager.running_servers()
        s = await manager.server_add(start=False)
        await manager.api.enable_injection(servers[1].ip_addr, 'raft_topology_barrier_fail', one_shot=True)
>       await manager.server_start(s.server_id, expected_error="Bootstrap failed. See earlier errors")

The test fails because it expects bootstrap to fail, due to arming of raft_topology_barrier_fail on servers[1]. However, if servers[1] is the topology coordinator, it will not execute the barrier command because it is excluded, probably here:

        guard = co_await exec_global_command(std::move(guard),
                raft_topology_cmd{raft_topology_cmd::command::barrier},
                {_raft.id()},
                drop_guard_and_retake::no);

servers[1] log: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8667/artifact/testlog/x86_64/dev/scylla-808.log

We can see:

INFO  2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - executing global topology command barrier, excluded nodes: {3a5c178f-20c3-4fce-bf56-4b0cf2c183c0}
DEBUG 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - send barrier command with term 1 and index 25 to ae87b773-97a6-4b15-a93c-4b540b541267/127.136.35.46
DEBUG 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - send barrier command with term 1 and index 25 to 4e47fd61-56a2-4030-9e95-83a807d781e2/127.136.35.33
INFO  2024-05-09 17:35:44,650 [shard 0:strm] raft_topology - updating topology state: committed new CDC generation, ID: (2024/05/09 14:35:44, 6b2046a0-0e11-11ef-2257-1d37ae5a1c05)
DEBUG 2024-05-09 17:35:44,652 [shard 0:strm] raft_topology - reload raft topology state
INFO  2024-05-09 17:35:44,661 [shard 0:strm] cdc - Started using generation (2024/05/09 14:35:44, 6b2046a0-0e11-11ef-2257-1d37ae5a1c05).

3a5c178f-20c3-4fce-bf56-4b0cf2c183c0 is the host id of servers[1].

So it's a test problem which assumes that servers[1] is not the coordinator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

No branches or pull requests

3 participants