Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize repair with tablet migration #18641

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tgrabiec
Copy link
Contributor

@tgrabiec tgrabiec commented May 13, 2024

We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.

Fixes #17658.
Fixes #18561.

@denesb
Copy link
Contributor

denesb commented May 13, 2024

@tgrabiec doesn't this fix #17658?

@tgrabiec
Copy link
Contributor Author

@tgrabiec doesn't this fix #17658?

Yes, updated description.

rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name);
erm = nullptr;
co_await _ss.local().await_topology_quiesced();
rlogger.info("repair[{}] Topology quiesced", rid.uuid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should await_topology_quiesced() return group0_guard? maybe not for this usage, but for others that might yield and want to read group0 state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need, the caller can also obtain the guard. If topology is not quiesced then, he can retry. That's what await_topology_quiesced() would have to do internally anyway to return the guard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the while (true) loop exits, what does it guarantee exactly?

Will holding a copy of the erm prevent any future topology operations / tablet migration? If so, how does it prevent the operations under the hood?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It guarantees that erm doesn't have tablets in transition.

Holding to that erm prevents transitions from starting, because topology coordinator waits for old erms to be released when it executes global token metadata barrier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It guarantees that erm doesn't have tablets in transition.

Only for the table in question which the erm belongs to, or for all tables?

Or even stronger, that no tablets (from any tables) in transitions and no topology changes cluster wide. Since await_topology_quiesced waits for topology_state_machine to be not busy, it means the cluster is not in any of the states (tablet_draining, tablet_migration etc.).

future<> storage_service::await_topology_quiesced() {
    auto holder = _async_gate.hold();

    if (this_shard_id() != 0) {
        // group0 is only set on shard 0.
        co_await container().invoke_on(0, [&] (auto& ss) {
            return ss.await_topology_quiesced();
        });
        co_return;
    }

    group0_guard guard = co_await _group0->client().start_operation(&_group0_as, raft_timeout{});

    while (_topology_state_machine._topology.is_busy()) {
        rtlogger.debug("await_topology_quiesced(): topology is busy");
        release_guard(std::move(guard));
        co_await _topology_state_machine.event.wait();
        guard = co_await _group0->client().start_operation(&_group0_as, raft_timeout{});
    }
}

bool topology::is_busy() const {
    return tstate.has_value();
}   

struct topology {
    enum class transition_state: uint16_t {
        join_group0,
        commit_cdc_generation,
        tablet_draining,
        write_both_read_old,
        write_both_read_new,
        tablet_migration,
        tablet_split_finalization,
        left_token_ring,
        rollback_to_normal,
    };

    std::optional<transition_state> tstate;

}

Holding to that erm prevents transitions from starting, because topology coordinator waits for old erms to be released when it executes global token metadata barrier.

Including any tablet migration and adding/removing node as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for the table in question which the erm belongs to, or for all tables?

Only for the table, because that's what we check for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including any tablet migration and adding/removing node as well?

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since await_topology_quiesced waits for topology_state_machine to be not busy, it means the cluster is not in any of the states (tablet_draining, tablet_migration etc.).

Yes, it waits for !busy, but this can change before we obtain the erm, so it's not guaranteed. What is guaranteed is what we check about the erm state.

@@ -3178,7 +3178,8 @@ class row_level_repair_gossip_helper : public gms::i_endpoint_state_change_subsc
}
};

repair_service::repair_service(distributed<gms::gossiper>& gossiper,
repair_service::repair_service(distributed<service::storage_service>& ss,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already pass repair_service to repair_service. This makes circular dependency.

    storage_service(abort_source& as, distributed<replica::database>& db,
...
        sharded<repair_service>& repair,

repair/repair.cc Outdated
}
rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name);
erm = nullptr;
co_await _ss.local().await_topology_quiesced();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a way to avoid using _ss service inside repair_service.

@bhalevy bhalevy added this to the 6.0 milestone May 23, 2024
@bhalevy bhalevy added backport/6.0 and removed backport/none Backport is not required labels May 23, 2024
@mykaul
Copy link
Contributor

mykaul commented May 27, 2024

DO NOT MERGE: Depends on #18026 (review last 4 commits).

I think this initial comment at least can be removed, as #18026 is merged already.

Tests will rely on that, they will run in shuffle mode, and disable
balancing around section which otherwise would be infinitely blocked
by ongoing shuffling (like repair).
…ed()

Will be used later in a place which doesn't have access to storage_service
but has to toplogy_state_machine.

It's not necessary to start group0 operation around polling because
the busy() state can be checked atomically and if it's false it means
the topology is no longer busy.
…rvice

It will be propagated to repair_service to avoid cyclic dependency:

storage_service <-> repair_service
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requets start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

Fixes scylladb#18641
@tgrabiec tgrabiec force-pushed the serialize-repair-with-tablet-migration branch from e794aac to 2fb59fb Compare May 28, 2024 07:45
@tgrabiec
Copy link
Contributor Author

In V2:

  • avoid cyclic dependency between repair_service and storage_service by extracting topology_state_machine
  • rebased

@kbr-scylla
Copy link
Contributor

Is this for general repair (vnode based too), or only tablet repair?

@@ -234,6 +234,12 @@ topology::upgrade_state_type upgrade_state_from_string(const sstring& s) {
on_internal_error(tsmlogger, format("cannot map name {} to upgrade_state", s));
}

future<> topology_state_machine::await_quiesced() {
while (_topology.is_busy()) {
co_await event.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this use an abort_source?

(I see it's a pre-existing issue, but still)

@@ -399,6 +399,10 @@ public:
return _transitions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re commit message repair: Exclude tablet migrations with tablet repair

contains Fixes #18641

so refers to the PR itself, not the issue

@kbr-scylla
Copy link
Contributor

@asias you had comments. Do you approve now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants