-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize repair with tablet migration #18641
base: master
Are you sure you want to change the base?
Serialize repair with tablet migration #18641
Conversation
rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name); | ||
erm = nullptr; | ||
co_await _ss.local().await_topology_quiesced(); | ||
rlogger.info("repair[{}] Topology quiesced", rid.uuid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should await_topology_quiesced() return group0_guard? maybe not for this usage, but for others that might yield and want to read group0 state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need, the caller can also obtain the guard. If topology is not quiesced then, he can retry. That's what await_topology_quiesced() would have to do internally anyway to return the guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the while (true) loop exits, what does it guarantee exactly?
Will holding a copy of the erm prevent any future topology operations / tablet migration? If so, how does it prevent the operations under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It guarantees that erm doesn't have tablets in transition.
Holding to that erm prevents transitions from starting, because topology coordinator waits for old erms to be released when it executes global token metadata barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It guarantees that erm doesn't have tablets in transition.
Only for the table in question which the erm belongs to, or for all tables?
Or even stronger, that no tablets (from any tables) in transitions and no topology changes cluster wide. Since await_topology_quiesced waits for topology_state_machine to be not busy, it means the cluster is not in any of the states (tablet_draining, tablet_migration etc.).
future<> storage_service::await_topology_quiesced() {
auto holder = _async_gate.hold();
if (this_shard_id() != 0) {
// group0 is only set on shard 0.
co_await container().invoke_on(0, [&] (auto& ss) {
return ss.await_topology_quiesced();
});
co_return;
}
group0_guard guard = co_await _group0->client().start_operation(&_group0_as, raft_timeout{});
while (_topology_state_machine._topology.is_busy()) {
rtlogger.debug("await_topology_quiesced(): topology is busy");
release_guard(std::move(guard));
co_await _topology_state_machine.event.wait();
guard = co_await _group0->client().start_operation(&_group0_as, raft_timeout{});
}
}
bool topology::is_busy() const {
return tstate.has_value();
}
struct topology {
enum class transition_state: uint16_t {
join_group0,
commit_cdc_generation,
tablet_draining,
write_both_read_old,
write_both_read_new,
tablet_migration,
tablet_split_finalization,
left_token_ring,
rollback_to_normal,
};
std::optional<transition_state> tstate;
}
Holding to that erm prevents transitions from starting, because topology coordinator waits for old erms to be released when it executes global token metadata barrier.
Including any tablet migration and adding/removing node as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for the table in question which the erm belongs to, or for all tables?
Only for the table, because that's what we check for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including any tablet migration and adding/removing node as well?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since await_topology_quiesced waits for topology_state_machine to be not busy, it means the cluster is not in any of the states (tablet_draining, tablet_migration etc.).
Yes, it waits for !busy, but this can change before we obtain the erm, so it's not guaranteed. What is guaranteed is what we check about the erm state.
repair/row_level.cc
Outdated
@@ -3178,7 +3178,8 @@ class row_level_repair_gossip_helper : public gms::i_endpoint_state_change_subsc | |||
} | |||
}; | |||
|
|||
repair_service::repair_service(distributed<gms::gossiper>& gossiper, | |||
repair_service::repair_service(distributed<service::storage_service>& ss, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already pass repair_service to repair_service. This makes circular dependency.
storage_service(abort_source& as, distributed<replica::database>& db,
...
sharded<repair_service>& repair,
repair/repair.cc
Outdated
} | ||
rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name); | ||
erm = nullptr; | ||
co_await _ss.local().await_topology_quiesced(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find a way to avoid using _ss service inside repair_service.
Tests will rely on that, they will run in shuffle mode, and disable balancing around section which otherwise would be infinitely blocked by ongoing shuffling (like repair).
…ed() Will be used later in a place which doesn't have access to storage_service but has to toplogy_state_machine. It's not necessary to start group0 operation around polling because the busy() state can be checked atomically and if it's false it means the topology is no longer busy.
…rvice It will be propagated to repair_service to avoid cyclic dependency: storage_service <-> repair_service
We want to exclude repair with tablet migrations to avoid races between repair reads and writes with replica movement. Repair is not prepared to handle topology transitions in the middle. One reason why it's not safe is that repair may successfully write to a leaving replica post streaming phase and consider all replicas to be repaired, but in fact they are not, the new replica would not be repaired. Other kinds of races could result in repair failures. If repair writes to a leaving replica which was already cleaned up, such writes will fail, causing repair to fail. Excluding works by keeping effective_replication_map_ptr in a version which doesn't have table's tablets in transitions. That prevents later transitions from starting because topology coordinator's barrier will wait for that erm before moving to a stage later than allow_write_both_read_old, so before any requets start using the new topology. Also, if transitions are already running, repair waits for them to finish. Fixes scylladb#18641
e794aac
to
2fb59fb
Compare
In V2:
|
Is this for general repair (vnode based too), or only tablet repair? |
@@ -234,6 +234,12 @@ topology::upgrade_state_type upgrade_state_from_string(const sstring& s) { | |||
on_internal_error(tsmlogger, format("cannot map name {} to upgrade_state", s)); | |||
} | |||
|
|||
future<> topology_state_machine::await_quiesced() { | |||
while (_topology.is_busy()) { | |||
co_await event.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this use an abort_source?
(I see it's a pre-existing issue, but still)
@@ -399,6 +399,10 @@ public: | |||
return _transitions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re commit message repair: Exclude tablet migrations with tablet repair
contains Fixes #18641
so refers to the PR itself, not the issue
@asias you had comments. Do you approve now? |
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.
One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.
Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.
Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.
A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.
Fixes #17658.
Fixes #18561.