Serialize repair with tablet migration #18641

tgrabiec · 2024-05-13T08:22:15Z

We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.

Fixes #17658.
Fixes #18561.

denesb · 2024-05-13T10:07:28Z

@tgrabiec doesn't this fix #17658?

tgrabiec · 2024-05-13T11:21:35Z

@tgrabiec doesn't this fix #17658?

Yes, updated description.

raphaelsc · 2024-05-13T12:30:04Z

repair/repair.cc

+            rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name);
+            erm = nullptr;
+            co_await _ss.local().await_topology_quiesced();
+            rlogger.info("repair[{}] Topology quiesced", rid.uuid());


should await_topology_quiesced() return group0_guard? maybe not for this usage, but for others that might yield and want to read group0 state.

There's no need, the caller can also obtain the guard. If topology is not quiesced then, he can retry. That's what await_topology_quiesced() would have to do internally anyway to return the guard.

After the while (true) loop exits, what does it guarantee exactly?

Will holding a copy of the erm prevent any future topology operations / tablet migration? If so, how does it prevent the operations under the hood?

It guarantees that erm doesn't have tablets in transition.

Holding to that erm prevents transitions from starting, because topology coordinator waits for old erms to be released when it executes global token metadata barrier.

It guarantees that erm doesn't have tablets in transition.

Only for the table in question which the erm belongs to, or for all tables?

Or even stronger, that no tablets (from any tables) in transitions and no topology changes cluster wide. Since await_topology_quiesced waits for topology_state_machine to be not busy, it means the cluster is not in any of the states (tablet_draining, tablet_migration etc.).

future<> storage_service::await_topology_quiesced() { auto holder = _async_gate.hold(); if (this_shard_id() != 0) { // group0 is only set on shard 0. co_await container().invoke_on(0, [&] (auto& ss) { return ss.await_topology_quiesced(); }); co_return; } group0_guard guard = co_await _group0->client().start_operation(&_group0_as, raft_timeout{}); while (_topology_state_machine._topology.is_busy()) { rtlogger.debug("await_topology_quiesced(): topology is busy"); release_guard(std::move(guard)); co_await _topology_state_machine.event.wait(); guard = co_await _group0->client().start_operation(&_group0_as, raft_timeout{}); } } bool topology::is_busy() const { return tstate.has_value(); } struct topology { enum class transition_state: uint16_t { join_group0, commit_cdc_generation, tablet_draining, write_both_read_old, write_both_read_new, tablet_migration, tablet_split_finalization, left_token_ring, rollback_to_normal, }; std::optional<transition_state> tstate; }

Holding to that erm prevents transitions from starting, because topology coordinator waits for old erms to be released when it executes global token metadata barrier.

Including any tablet migration and adding/removing node as well?

Only for the table in question which the erm belongs to, or for all tables?

Only for the table, because that's what we check for.

Including any tablet migration and adding/removing node as well?

Yes.

Since await_topology_quiesced waits for topology_state_machine to be not busy, it means the cluster is not in any of the states (tablet_draining, tablet_migration etc.).

Yes, it waits for !busy, but this can change before we obtain the erm, so it's not guaranteed. What is guaranteed is what we check about the erm state.

asias · 2024-05-21T05:31:45Z

repair/row_level.cc

@@ -3178,7 +3178,8 @@ class row_level_repair_gossip_helper : public gms::i_endpoint_state_change_subsc
    }
 };

-repair_service::repair_service(distributed<gms::gossiper>& gossiper,
+repair_service::repair_service(distributed<service::storage_service>& ss,


We already pass repair_service to repair_service. This makes circular dependency.

storage_service(abort_source& as, distributed<replica::database>& db, ... sharded<repair_service>& repair,

asias · 2024-05-21T05:40:12Z

repair/repair.cc

+            }
+            rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name);
+            erm = nullptr;
+            co_await _ss.local().await_topology_quiesced();


We need to find a way to avoid using _ss service inside repair_service.

mykaul · 2024-05-27T07:56:33Z

DO NOT MERGE: Depends on #18026 (review last 4 commits).

I think this initial comment at least can be removed, as #18026 is merged already.

Tests will rely on that, they will run in shuffle mode, and disable balancing around section which otherwise would be infinitely blocked by ongoing shuffling (like repair).

…ed() Will be used later in a place which doesn't have access to storage_service but has to toplogy_state_machine. It's not necessary to start group0 operation around polling because the busy() state can be checked atomically and if it's false it means the topology is no longer busy.

…rvice It will be propagated to repair_service to avoid cyclic dependency: storage_service <-> repair_service

We want to exclude repair with tablet migrations to avoid races between repair reads and writes with replica movement. Repair is not prepared to handle topology transitions in the middle. One reason why it's not safe is that repair may successfully write to a leaving replica post streaming phase and consider all replicas to be repaired, but in fact they are not, the new replica would not be repaired. Other kinds of races could result in repair failures. If repair writes to a leaving replica which was already cleaned up, such writes will fail, causing repair to fail. Excluding works by keeping effective_replication_map_ptr in a version which doesn't have table's tablets in transitions. That prevents later transitions from starting because topology coordinator's barrier will wait for that erm before moving to a stage later than allow_write_both_read_old, so before any requets start using the new topology. Also, if transitions are already running, repair waits for them to finish. Fixes scylladb#18641

tgrabiec · 2024-05-28T08:59:46Z

In V2:

avoid cyclic dependency between repair_service and storage_service by extracting topology_state_machine
rebased

kbr-scylla · 2024-05-28T16:21:53Z

Is this for general repair (vnode based too), or only tablet repair?

kbr-scylla · 2024-05-28T16:26:53Z

service/topology_state_machine.cc

@@ -234,6 +234,12 @@ topology::upgrade_state_type upgrade_state_from_string(const sstring& s) {
    on_internal_error(tsmlogger, format("cannot map name {} to upgrade_state", s));
 }

+future<> topology_state_machine::await_quiesced() {
+    while (_topology.is_busy()) {
+        co_await event.wait();


Why doesn't this use an abort_source?

(I see it's a pre-existing issue, but still)

kbr-scylla · 2024-05-28T16:31:59Z

locator/tablets.hh

@@ -399,6 +399,10 @@ public:
        return _transitions;


re commit message repair: Exclude tablet migrations with tablet repair

contains Fixes #18641

so refers to the PR itself, not the issue

kbr-scylla · 2024-05-28T16:33:10Z

@asias you had comments. Do you approve now?

tgrabiec requested review from denesb, nyh and piodul as code owners May 13, 2024 08:22

tgrabiec added area/repair area/tablets backport/none Backport is not required labels May 13, 2024

tgrabiec requested a review from asias May 13, 2024 08:24

raphaelsc reviewed May 13, 2024

View reviewed changes

tgrabiec mentioned this pull request May 20, 2024

Serialize tablet repair and tablet migration #18561

Open

asias reviewed May 21, 2024

View reviewed changes

bhalevy added this to the 6.0 milestone May 23, 2024

bhalevy added backport/6.0 and removed backport/none Backport is not required labels May 23, 2024

tgrabiec added 6 commits May 28, 2024 09:43

tablet_scheduler: Log whether balancing is considered as enabled

1999385

tablet_scheduler: Make disabling of balancing interrupt shuffle mode

1e5b634

Tests will rely on that, they will run in shuffle mode, and disable balancing around section which otherwise would be infinitely blocked by ongoing shuffling (like repair).

main, storage_service: Move topology_state_machine outside storage_se…

5a9437e

…rvice It will be propagated to repair_service to avoid cyclic dependency: storage_service <-> repair_service

repair_service: Propagate topology_state_machine to repair_service

52566fd

tgrabiec force-pushed the serialize-repair-with-tablet-migration branch from e794aac to 2fb59fb Compare May 28, 2024 07:45

tgrabiec requested review from kbr-scylla and asias May 28, 2024 09:00

kbr-scylla requested a review from gleb-cloudius May 28, 2024 09:12

kbr-scylla reviewed May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize repair with tablet migration #18641

Serialize repair with tablet migration #18641

tgrabiec commented May 13, 2024 •

edited

denesb commented May 13, 2024

tgrabiec commented May 13, 2024

raphaelsc May 13, 2024

tgrabiec May 13, 2024

asias May 21, 2024

tgrabiec May 21, 2024

asias May 22, 2024

tgrabiec May 28, 2024

tgrabiec May 28, 2024

tgrabiec May 28, 2024

asias May 21, 2024

asias May 21, 2024

mykaul commented May 27, 2024

tgrabiec commented May 28, 2024

kbr-scylla commented May 28, 2024

kbr-scylla May 28, 2024

kbr-scylla May 28, 2024

kbr-scylla commented May 28, 2024

Serialize repair with tablet migration #18641

Are you sure you want to change the base?

Serialize repair with tablet migration #18641

Conversation

tgrabiec commented May 13, 2024 • edited

denesb commented May 13, 2024

tgrabiec commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mykaul commented May 27, 2024

tgrabiec commented May 28, 2024

kbr-scylla commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbr-scylla commented May 28, 2024

tgrabiec commented May 13, 2024 •

edited