View update backlog may not be gossiped if it's already been updated with a response #18461

wmitros · 2024-04-29T11:32:49Z

The view update backlog is propagated in two ways:

With responses to requests:

Lines 2411 to 2447 in 4433d2e

    
           void storage_proxy::got_response(storage_proxy::response_id_type id, gms::inet_address from, std::optional<db::view::update_backlog> backlog) { 
        
               auto it = _response_handlers.find(id); 
        
               if (it != _response_handlers.end()) { 
        
                   tracing::trace(it->second->get_trace_state(), "Got a response from /{}", from); 
        
                   if (it->second->response(from)) { 
        
                       remove_response_handler_entry(std::move(it)); // last one, remove entry. Will cancel expiration timer too. 
        
                   } else { 
        
                       it->second->check_for_early_completion(); 
        
                   } 
        
               } 
        
               maybe_update_view_backlog_of(std::move(from), std::move(backlog)); 
        
           } 
        
           void storage_proxy::got_failure_response(storage_proxy::response_id_type id, gms::inet_address from, size_t count, std::optional<db::view::update_backlog> backlog, error err, std::optional<sstring> msg) { 
        
               auto it = _response_handlers.find(id); 
        
               if (it != _response_handlers.end()) { 
        
                   tracing::trace(it->second->get_trace_state(), "Got {} failures from /{}", count, from); 
        
                   if (it->second->failure_response(from, count, err, std::move(msg))) { 
        
                       remove_response_handler_entry(std::move(it)); 
        
                   } else { 
        
                       it->second->check_for_early_completion(); 
        
                   } 
        
               } 
        
               maybe_update_view_backlog_of(std::move(from), std::move(backlog)); 
        
           } 
        
           void storage_proxy::maybe_update_view_backlog_of(gms::inet_address replica, std::optional<db::view::update_backlog> backlog) { 
        
               if (backlog) { 
        
                   auto now = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count(); 
        
                   _view_update_backlogs[replica] = {std::move(*backlog), now}; 
        
               } 
        
           } 
        
           db::view::update_backlog storage_proxy::get_view_update_backlog() const { 
        
               return _max_view_update_backlog.add_fetch(this_shard_id(), get_db().local().get_view_update_backlog()); 
        
           }

Every gms::gossiper::INTERVAL (1s) through view_update_backlog_broker using gossip:

scylladb/service/misc_services.cc

Lines 237 to 263 in 4433d2e

    
           future<> view_update_backlog_broker::start() { 
        
               _gossiper.register_(shared_from_this()); 
        
               if (this_shard_id() == 0) { 
        
                   // Gossiper runs only on shard 0, and there's no API to add multiple, per-shard application states. 
        
                   // Also, right now we aggregate all backlogs, since the coordinator doesn't keep per-replica shard backlogs. 
        
                   _started = seastar::async([this] { 
        
                       std::optional<db::view::update_backlog> backlog_published; 
        
                       while (!_as.abort_requested()) { 
        
                           auto backlog = _sp.local().get_view_update_backlog(); 
        
                           if (backlog_published && *backlog_published == backlog) { 
        
                               sleep_abortable(gms::gossiper::INTERVAL, _as).get(); 
        
                               continue; 
        
                           } 
        
                           auto now = api::timestamp_type(std::chrono::duration_cast<std::chrono::milliseconds>( 
        
                                   std::chrono::system_clock::now().time_since_epoch()).count()); 
        
                           //FIXME: discarded future. 
        
                           (void)_gossiper.add_local_application_state( 
        
                                   gms::application_state::VIEW_BACKLOG, 
        
                                   gms::versioned_value(seastar::format("{}:{}:{}", backlog.current, backlog.max, now))); 
        
                           backlog_published = backlog; 
        
                           sleep_abortable(gms::gossiper::INTERVAL, _as).get(); 
        
                       } 
        
                   }).handle_exception_type([] (const seastar::sleep_aborted& ignored) { }); 
        
               } 
        
               return make_ready_future<>(); 
        
           }

Consider the following scenario:

Cluster starts, all nodes gossip their empty view update backlog to one another
On node N, view_update_backlog_broker performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
In the following iterations of view_update_backlog_broker on N, backlog is empty, so no change is seen and no update is sent

After this scenario happens, the coordinator stores an information about an increased view update backlog on N even though it's actually empty

This can be fixed either by updating the last sent value backlog_published stored by view_update_backlog_broker also when handling responses, or by sending backlog updates even when we didn't notice a change.

The text was updated successfully, but these errors were encountered:

wmitros · 2024-04-29T13:06:14Z

@piodul are you familiar with how expensive gossiping is? If we go with the simpler solution here, we'll have add_local_application_state each second which may be very expensive if there are periods when we're not gossiping at all, or it may be free if we're gossiping each second anyway.

There's also problem with the second solution - the backlog in a response is only propagated to one node, so if we update the last sent backlog in gossip also with backlogs sent in responses, we may think we propagated the backlog already, when actually it's been only propagated to one node.

Perhaps we should go with a third approach - when sending replies, only note that the backlog has changed since the last gossip round and still keep the last sent gossip backlog in view_update_backlog_broker. This should avoid the second issue and gossip will keep sending updates each second only as long as we're performing requests with view updates

piodul · 2024-04-29T14:59:06Z

@piodul are you familiar with how expensive gossiping is?

In general, AFAIK we try to avoid gossiping data unnecessarily. It probably depends mostly on the size of the data.

The most reasonable solution for me would be to have something like this:

Have a flag bool need_publishing;
When the local view update backlog changes, set need_publishing = true
Run a fiber which, every second:
- If need_publishing then update the application state in gossip and set need_publishing = false

This is of course a naive model because it only assumes one shard. In reality, calculating the backlog is done with atomics (see node_update_backlog::add_fetch), so this becomes more complicated.

Perhaps we could have a per-shard, non-atomic need_publishing variable; the fiber would use invoke_on_all and do the check I mentioned on each local shard and would update the application state if the flag was true on any of the shards. This approach would avoid all the concerns related to the ordering of atomics (each shard's backlog is only written by that shard, so we properly serialize with any potential updates of the shard-local backlog).

wmitros · 2024-05-02T05:02:31Z

After discussing this with @piodul @kostja and @gleb-cloudius, there are a few things worth noting. There are gossip services similar to the view_update_backlog_broker that we're using, particularly the load gossiper and the cache-hit-rate gossiper. The load gossiper broadcasts the load every 60s and the cache-hit-rate gossiper sends its updates every 2s. The gossiped values are used mainly when the node is starting, later they are always obsolete anyway.
The main difference in the view_update_backlog_broker is that it may be completely unused - in contrast to load and cache-hit-rate which are useful in practically every workload. It may also be used periodically, and between the periods, updates from gossip may be needed.
With that in mind, we found a few approaches we can try here:

The approach mentioned by @piodul in View update backlog may not be gossiped if it's already been updated with a response #18461 (comment), which has benefits in form of relatively low complexity and performance costs
Simply send the view update backlog in each iteration of the gossiping loop, this would have the biggest performance cost but lowest complexity
Time-out view update backlog values received in responses after some time - in this case we would assume that if we didn't get an update from gossip, the backlog dropped to 0 (or to the last gossiped value). This approach would be relatively simple and inexpensive, but would allow a higher temporary discrepancy
Implement another way of propagating view update backlog sizes. Currently the propagation works well as long as there are frequent updates from the same coordinator to the same node, the values propagated using gossip quite outdated in comparison.

avikivity · 2024-05-02T10:01:56Z

If it's gossiped every 60 seconds is an optimization worthwhile?

kostja · 2024-05-02T10:19:18Z

@avikivity the view update backlog is gossiped every second.

nyh · 2024-05-02T12:52:21Z

@wmitros how does this issue relate to #18462? It seems your original problem statement refers to the case where a zero backlog estimate is not gossipped, so some non-zero estimate sent in some previous request gets kept forever. If this is the problem then this is exactly issue #18462 - no need for both issues.

wmitros · 2024-05-02T12:57:23Z

@wmitros how does this issue relate to #18462? It seems your original problem statement refers to the case where a zero backlog estimate is not gossipped, so some non-zero estimate sent in some previous request gets kept forever. If this is the problem then this is exactly issue #18462 - no need for both issues.

These issues have similar symptoms but they are separate issues. #18462 refers only to receiving "empty" backlogs from gossip and this issue is about sending repeating backlogs (which probably are most likely to be 0 as well, but don't have to be).

…onse Currently, there are 2 ways of sharing a backlog with other nodes: through a gossip mechanism, and with responses to replica writes. In gossip, we check each second if the backlog changed, and if it did we update other nodes with it. However if the backlog for this node changed on another node with a write response, the gossiped backlog is currently not updated, so if after the response the backlog goes back to the value from the previous gossip round, it will not get sent and the other node will stay with an outdated backlog. This patch changes this by notifying the gossip that a different backlog was sent in a response, causing it to send an unchanged backlog to other nodes in the following gossip round. Fixes: scylladb#18461

piodul self-assigned this Apr 29, 2024

kostja assigned kostja and wmitros Apr 30, 2024

kostja added the area/materialized views label Apr 30, 2024

wmitros mentioned this issue May 2, 2024

View update backlog may not be read from gossip if it drops to 0 #18462

Closed

wmitros mentioned this issue May 7, 2024

mv: reject user requests by coordinator when a replica is overloaded by MVs #18334

Draft

wmitros linked a pull request May 14, 2024 that will close this issue

mv: gossip the same backlog if a different backlog was sent in a response #18663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

View update backlog may not be gossiped if it's already been updated with a response #18461

View update backlog may not be gossiped if it's already been updated with a response #18461

wmitros commented Apr 29, 2024

wmitros commented Apr 29, 2024

piodul commented Apr 29, 2024

wmitros commented May 2, 2024 •

edited

avikivity commented May 2, 2024

kostja commented May 2, 2024

nyh commented May 2, 2024

wmitros commented May 2, 2024

View update backlog may not be gossiped if it's already been updated with a response #18461

View update backlog may not be gossiped if it's already been updated with a response #18461

Comments

wmitros commented Apr 29, 2024

wmitros commented Apr 29, 2024

piodul commented Apr 29, 2024

wmitros commented May 2, 2024 • edited

avikivity commented May 2, 2024

kostja commented May 2, 2024

nyh commented May 2, 2024

wmitros commented May 2, 2024

wmitros commented May 2, 2024 •

edited