Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

View update backlog may not be gossiped if it's already been updated with a response #18461

Open
wmitros opened this issue Apr 29, 2024 · 7 comments · May be fixed by #18663
Open

View update backlog may not be gossiped if it's already been updated with a response #18461

wmitros opened this issue Apr 29, 2024 · 7 comments · May be fixed by #18663

Comments

@wmitros
Copy link
Contributor

wmitros commented Apr 29, 2024

The view update backlog is propagated in two ways:

  1. With responses to requests:
    void storage_proxy::got_response(storage_proxy::response_id_type id, gms::inet_address from, std::optional<db::view::update_backlog> backlog) {
    auto it = _response_handlers.find(id);
    if (it != _response_handlers.end()) {
    tracing::trace(it->second->get_trace_state(), "Got a response from /{}", from);
    if (it->second->response(from)) {
    remove_response_handler_entry(std::move(it)); // last one, remove entry. Will cancel expiration timer too.
    } else {
    it->second->check_for_early_completion();
    }
    }
    maybe_update_view_backlog_of(std::move(from), std::move(backlog));
    }
    void storage_proxy::got_failure_response(storage_proxy::response_id_type id, gms::inet_address from, size_t count, std::optional<db::view::update_backlog> backlog, error err, std::optional<sstring> msg) {
    auto it = _response_handlers.find(id);
    if (it != _response_handlers.end()) {
    tracing::trace(it->second->get_trace_state(), "Got {} failures from /{}", count, from);
    if (it->second->failure_response(from, count, err, std::move(msg))) {
    remove_response_handler_entry(std::move(it));
    } else {
    it->second->check_for_early_completion();
    }
    }
    maybe_update_view_backlog_of(std::move(from), std::move(backlog));
    }
    void storage_proxy::maybe_update_view_backlog_of(gms::inet_address replica, std::optional<db::view::update_backlog> backlog) {
    if (backlog) {
    auto now = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
    _view_update_backlogs[replica] = {std::move(*backlog), now};
    }
    }
    db::view::update_backlog storage_proxy::get_view_update_backlog() const {
    return _max_view_update_backlog.add_fetch(this_shard_id(), get_db().local().get_view_update_backlog());
    }
  2. Every gms::gossiper::INTERVAL (1s) through view_update_backlog_broker using gossip:
    future<> view_update_backlog_broker::start() {
    _gossiper.register_(shared_from_this());
    if (this_shard_id() == 0) {
    // Gossiper runs only on shard 0, and there's no API to add multiple, per-shard application states.
    // Also, right now we aggregate all backlogs, since the coordinator doesn't keep per-replica shard backlogs.
    _started = seastar::async([this] {
    std::optional<db::view::update_backlog> backlog_published;
    while (!_as.abort_requested()) {
    auto backlog = _sp.local().get_view_update_backlog();
    if (backlog_published && *backlog_published == backlog) {
    sleep_abortable(gms::gossiper::INTERVAL, _as).get();
    continue;
    }
    auto now = api::timestamp_type(std::chrono::duration_cast<std::chrono::milliseconds>(
    std::chrono::system_clock::now().time_since_epoch()).count());
    //FIXME: discarded future.
    (void)_gossiper.add_local_application_state(
    gms::application_state::VIEW_BACKLOG,
    gms::versioned_value(seastar::format("{}:{}:{}", backlog.current, backlog.max, now)));
    backlog_published = backlog;
    sleep_abortable(gms::gossiper::INTERVAL, _as).get();
    }
    }).handle_exception_type([] (const seastar::sleep_aborted& ignored) { });
    }
    return make_ready_future<>();
    }

Consider the following scenario:

  1. Cluster starts, all nodes gossip their empty view update backlog to one another
  2. On node N, view_update_backlog_broker performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
  3. Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
  4. Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
  5. In the following iterations of view_update_backlog_broker on N, backlog is empty, so no change is seen and no update is sent

After this scenario happens, the coordinator stores an information about an increased view update backlog on N even though it's actually empty

This can be fixed either by updating the last sent value backlog_published stored by view_update_backlog_broker also when handling responses, or by sending backlog updates even when we didn't notice a change.

@piodul piodul self-assigned this Apr 29, 2024
@wmitros
Copy link
Contributor Author

wmitros commented Apr 29, 2024

@piodul are you familiar with how expensive gossiping is? If we go with the simpler solution here, we'll have add_local_application_state each second which may be very expensive if there are periods when we're not gossiping at all, or it may be free if we're gossiping each second anyway.

There's also problem with the second solution - the backlog in a response is only propagated to one node, so if we update the last sent backlog in gossip also with backlogs sent in responses, we may think we propagated the backlog already, when actually it's been only propagated to one node.

Perhaps we should go with a third approach - when sending replies, only note that the backlog has changed since the last gossip round and still keep the last sent gossip backlog in view_update_backlog_broker. This should avoid the second issue and gossip will keep sending updates each second only as long as we're performing requests with view updates

@piodul
Copy link
Contributor

piodul commented Apr 29, 2024

@piodul are you familiar with how expensive gossiping is?

In general, AFAIK we try to avoid gossiping data unnecessarily. It probably depends mostly on the size of the data.

The most reasonable solution for me would be to have something like this:

  • Have a flag bool need_publishing;
  • When the local view update backlog changes, set need_publishing = true
  • Run a fiber which, every second:
    • If need_publishing then update the application state in gossip and set need_publishing = false

This is of course a naive model because it only assumes one shard. In reality, calculating the backlog is done with atomics (see node_update_backlog::add_fetch), so this becomes more complicated.

Perhaps we could have a per-shard, non-atomic need_publishing variable; the fiber would use invoke_on_all and do the check I mentioned on each local shard and would update the application state if the flag was true on any of the shards. This approach would avoid all the concerns related to the ordering of atomics (each shard's backlog is only written by that shard, so we properly serialize with any potential updates of the shard-local backlog).

@wmitros
Copy link
Contributor Author

wmitros commented May 2, 2024

After discussing this with @piodul @kostja and @gleb-cloudius, there are a few things worth noting. There are gossip services similar to the view_update_backlog_broker that we're using, particularly the load gossiper and the cache-hit-rate gossiper. The load gossiper broadcasts the load every 60s and the cache-hit-rate gossiper sends its updates every 2s. The gossiped values are used mainly when the node is starting, later they are always obsolete anyway.
The main difference in the view_update_backlog_broker is that it may be completely unused - in contrast to load and cache-hit-rate which are useful in practically every workload. It may also be used periodically, and between the periods, updates from gossip may be needed.
With that in mind, we found a few approaches we can try here:

  1. The approach mentioned by @piodul in View update backlog may not be gossiped if it's already been updated with a response #18461 (comment), which has benefits in form of relatively low complexity and performance costs
  2. Simply send the view update backlog in each iteration of the gossiping loop, this would have the biggest performance cost but lowest complexity
  3. Time-out view update backlog values received in responses after some time - in this case we would assume that if we didn't get an update from gossip, the backlog dropped to 0 (or to the last gossiped value). This approach would be relatively simple and inexpensive, but would allow a higher temporary discrepancy
  4. Implement another way of propagating view update backlog sizes. Currently the propagation works well as long as there are frequent updates from the same coordinator to the same node, the values propagated using gossip quite outdated in comparison.

@avikivity
Copy link
Member

If it's gossiped every 60 seconds is an optimization worthwhile?

@kostja
Copy link
Contributor

kostja commented May 2, 2024

@avikivity the view update backlog is gossiped every second.

@nyh
Copy link
Contributor

nyh commented May 2, 2024

@wmitros how does this issue relate to #18462? It seems your original problem statement refers to the case where a zero backlog estimate is not gossipped, so some non-zero estimate sent in some previous request gets kept forever. If this is the problem then this is exactly issue #18462 - no need for both issues.

@wmitros
Copy link
Contributor Author

wmitros commented May 2, 2024

@wmitros how does this issue relate to #18462? It seems your original problem statement refers to the case where a zero backlog estimate is not gossipped, so some non-zero estimate sent in some previous request gets kept forever. If this is the problem then this is exactly issue #18462 - no need for both issues.

These issues have similar symptoms but they are separate issues. #18462 refers only to receiving "empty" backlogs from gossip and this issue is about sending repeating backlogs (which probably are most likely to be 0 as well, but don't have to be).

wmitros added a commit to wmitros/scylla that referenced this issue May 14, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 15, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 15, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 15, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 21, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 21, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 21, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
wmitros added a commit to wmitros/scylla that referenced this issue May 21, 2024
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants