-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mv: gossip the same backlog if a different backlog was sent in a response #18663
base: master
Are you sure you want to change the base?
Conversation
This patch only concerns the last commit. Is based on #18646 to avoid later conflicts |
🔴 CI State: FAILURE❌ - Build Build Details:
|
Rebased on updated #18646 and fixed compilation error |
🔴 CI State: FAILURE❌ - Build Build Failure:
Build Details:
|
6c91864
to
33f4a75
Compare
🟢 CI State: SUCCESS✅ - Build Build Details:
|
db/view/node_view_update_backlog.hh
Outdated
@@ -37,16 +39,20 @@ class node_update_backlog { | |||
std::chrono::milliseconds _interval; | |||
std::atomic<clock::time_point> _last_update; | |||
std::atomic<update_backlog> _max; | |||
std::vector<std::atomic<need_publishing>> _need_publishing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need_publishing
flags will be put very close together. Even though each one will be touched 99% of the time by a single shard, due to the fact that they will most likely land in a single cache performance will be affected due to false sharing.
It would be much better to put this into the per_shard_backlog
, right after the backlog
field. The alignment on backlog
will prevent the false sharing issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, I'm not sure these have to be atomics. You could change add_fetch_for_gossip
to perform invoke_on_all
to gather values on all shards. It only runs once a second, so it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
service/storage_proxy.cc
Outdated
std::optional<db::view::update_backlog> storage_proxy::get_view_update_backlog_for_gossip(std::optional<db::view::update_backlog> last_backlog) const { | ||
return _max_view_update_backlog.add_fetch_for_gossip(this_shard_id(), get_db().local().get_view_update_backlog(), std::move(last_backlog)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gossiper runs on shard 0, so the shard parameter is unneeded. It would make more sense to check the shard and do on_internal_error is it is not 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
db/view/view.cc
Outdated
} | ||
need_publishing = np.exchange(need_publishing::no, std::memory_order_relaxed); | ||
} | ||
auto max = add_fetch(shard, backlog); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that the previous code used to do this, but do we really have to? Wouldn't be sufficient for the gossiper fiber to only gather data and not update the local shard's data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
service/storage_proxy.hh
Outdated
db::view::update_backlog get_view_update_backlog() const; | ||
db::view::update_backlog get_view_update_backlog_for_response() const; | ||
|
||
std::optional<db::view::update_backlog> get_view_update_backlog_for_gossip(std::optional<db::view::update_backlog> last_backlog) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like these names because they don't really tell much about what those functions do. It's not obvious what are the requirements of "response" or "gossip".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the names a bit
db/view/view.cc
Outdated
if (!need_publishing && last_backlog && *last_backlog == max) { | ||
return std::nullopt; | ||
} else { | ||
return max; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to check agains the last backlog at all? If the value has changed then one of the need_publishing
flags must have been updated, no?
Your description is very information-dense and thus difficult to parse. In such cases, an example usually helps in understanding the bug. The example that you already provided in the description of the issue was, in my opinion, quite clear, so you could write an example here based on that. |
To get the view update backlog in a response, we need to update it on the shard which contained the altered data. The write is redirected to the correct shard in storage_proxy::mutate_locally, where database::apply is invoked on the correct shard. However, we don't want to also invoke the storage_proxy::get_view_update_backlog on the correct shard there, because that would cost us an additional allocation (a .then() following the db.apply()) on the common write path. More precisely, as discovered using perf-simple-query, for a single operation, we get from 59 allocs, 15 tasks and ~52450 instructions to 60 allocs, 16 tasks and ~52950 instructions. Instead, we can add the backlog update inside the database::apply call. However, the call shouldn't be directly a part of apply() method, as we don't want to refer to the storage_proxy there. It also shouldn't be a part of the table::push_view_replica_updates() method for the same reason. As a result, it needs to be returned from view_update_generator::generate_and_propagate_view_updates(), and passed as a return value through the call stack. This patch prepares all methods in the stack by changing their return value types. For now, the returned values are empty - the actual value to return will be added in a following patch.
When performing a write, we should update the view update backlog on the node and shard where the mutation is actually applied, and potentially also on the coordinator of this write, with the updated value from the replica. Currently, instead of updating the backlog on the shard that applies the mutation, we do it on the coordinator shard of this write, and as a result, the backlog on the correct shard is not updated. This patch enables updating the backlog on the correct shard. To achieve that, the update is performed just after the views are pushed, in view_update_generator::generate_and_propagate_view_updates on the correct shard, and later propagated as a return value to be finally used for updating the max view backlog of the node.
This patch adds a test for reproducing issue scylladb#18542 The test performs writes on a table with a materialized view and checks that the view backlog increases. To get the current view update backlog, a new metric "view_update_backlog" is added to the `storage_proxy` metrics. The metric differs from the metric from `database` metric with the same name by taking the backlog from the max_view_update_backlog which keeps view update backlogs from all shards which may be a bit outdated, instead of taking the backlog by checking the view_update_semaphore which the backlog is based on directly.
I've mostly copied the example, hopefully that's good enough. Also I rebased to fix the merge conflicts. At the same time, I added a commit for better readability to this PR so the patch is now the last 2 commits, not only the last one |
🔴 CI State: FAILURE❌ - Build Build Details:
|
Currently, we only update the backlogs in node_update_backlog at the same time when we're fetching them. This is done using storage_proxy's method get_view_update_backlog, which is confusing because it's a getter with side-effects. Additionally, we don't always want to update the backlog when we're reading it (as in gossip which is only on shard 0) and we don't always want to read it when we're updating it (when we're not handling any writes but the backlog drops due to background work finish). This patch divides the node_view_backlog::add_fetch as well the storage_proxy::get_view_update_backlog both into two methods; one for updating and one for reading the backlog. This patch only replaces the places where we're currently using the view backlog getter, more situations where we should get/update the backlog should be considered in a following patch.
…onse Currently, there are 2 ways of sharing a backlog with other nodes: through a gossip mechanism, and with responses to replica writes. In gossip, we check each second if the backlog changed, and if it did we update other nodes with it. However if the backlog for this node changed on another node with a write response, the gossiped backlog is currently not updated, so if after the response the backlog goes back to the value from the previous gossip round, it will not get sent and the other node will stay with an outdated backlog. This patch changes this by notifying the gossip that a different backlog was sent in a response, causing it to send an unchanged backlog to other nodes in the following gossip round. Fixes: scylladb#18461
🔴 CI State: FAILURE❌ - Build Build Failure:
Build Details:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description looks better, thanks.
I reviewed the last commit. I left some comments. Please fix the nits, the compilation error and, if possible, rebase on master so that this commit does not require #18646.
Otherwise, looks good.
}); | ||
|
||
auto max = fetch(); | ||
if (!np && last_backlog && *last_backlog == max) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked about this before, but the comment was resolved.
Isn't it sufficient to just check:
if (!np) {
?
Previously, we used to compare the last backlog vs the current one in order to decide whether it needs publishing. Now, we have the need_publishing
flag. Is the last_backlog && *last_backlog == max
still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole if..else could be written as a ternary operator:
return np ? max : std::nullopt;
I didn't try to compile it, there is a chance that you'll get a type error because both branches don't have the same type, however it would be nice to try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked about this before, but the comment was resolved.
Isn't it sufficient to just check:
if (!np) {
?
Previously, we used to compare the last backlog vs the current one in order to decide whether it needs publishing. Now, we have the
need_publishing
flag. Is thelast_backlog && *last_backlog == max
still necessary?
After #18804 is merged and the need_publishing
is also updated in that context, the last_backlog && *last_backlog == max
check won't be needed anymore. Without #18804 the need_publishing
flag is not set when a backlog drops, so we have no other way of detecting it other than comparing with the last sent backlog
if (_backlogs[shard].need_publishing) { | ||
_backlogs[shard].need_publishing = need_publishing::no; | ||
return need_publishing::yes; | ||
} | ||
return need_publishing::no; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you can use std::exchange
to make the body of the lambda nice and short:
return std::exchange(_backlogs[shard].need_publishing, need_publishing::no);
@@ -235,6 +235,8 @@ public: | |||
// Get information about this node's view update backlog. It combines information from all local shards. | |||
db::view::update_backlog get_view_update_backlog(); | |||
|
|||
future<std::optional<db::view::update_backlog>> get_view_update_backlog_if_changed(std::optional<db::view::update_backlog> last_backlog); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you require that this function must be called on shard 0 only. Please put this information in a comment near the signature so that the future users are not surprised.
return a || b; | ||
}); | ||
|
||
auto max = fetch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be prone to races. Consider the following scenario:
_backlog
is updated on shard 1, theneed_publishing
flag is set,- The
_last_update
is recalculated, next one will happen 10ms from now or later, - 5ms pass,
fetch_if_changed
and clears theneed_publishing
flag on shard 1, but usesfetch
to read the old value,- There are no MV writes anymore and we didn't publish the updated backlog on shard 1.
You should collect the backlogs in the submit_to
lambda and calculate the maximum in the reducer lambda (you should still collect the np flag in addition to this).
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog - this can be observed in the following scenario:
view_update_backlog_broker
(the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1sview_update_backlog_broker
on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the checkAfter this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty
This patch fixes the issue this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.
Fixes: #18461
Similarly to #18646, without admission control (#18334), this patch doesn't affect much, so I'm marking it as backport/none
Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests