storage_proxy: update view update backlog on correct shard when writing #18646

wmitros · 2024-05-13T11:37:38Z

When a replica applies a write on a table which has a materialized view
it generates view updates. These updates take memory which is tracked
by database::_view_update_concurrency_sem, separate on each shard.
The fraction of units taken from the semaphore to the semaphore limit
is the shard's view update backlog. Based on these backlogs, we want
to estimate how busy a node is with its view updates work. We do that
by taking the max backlog across all shards.
To avoid excessive cross-shard operations, the node's (max) backlog isn't
calculated each time we need it, but up to 1 time per 10ms (the _interval) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it):

 update_backlog node_update_backlog::add_fetch(unsigned shard, update_backlog backlog) { 
     _backlogs[shard].backlog.store(backlog, std::memory_order_relaxed); 
     auto now = clock::now(); 
     if (now >= _last_update.load(std::memory_order_relaxed) + _interval) { 
         _last_update.store(now, std::memory_order_relaxed); 
         auto new_max = boost::accumulate( 
                 _backlogs, 
                 update_backlog::no_backlog(), 
                 [] (const update_backlog& lhs, const per_shard_backlog& rhs) { 
                     return std::max(lhs, rhs.load()); 
                 }); 
         _max.store(new_max, std::memory_order_relaxed); 
         return new_max; 
     } 
     return std::max(backlog, _max.load(std::memory_order_relaxed)); 
 }

For the same reason, even when we do calculate the new node's backlog,
we don't read from the _view_update_concurrency_sem. Instead, for
each shard we also store a update_backlog atomic which we use for
calculation:

 struct per_shard_backlog { 
     // Multiply by 2 to defeat the prefetcher 
     alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog(); 
  
     update_backlog load() const { 
         return backlog.load(std::memory_order_relaxed); 
     } 
 }; 
 std::vector<per_shard_backlog> _backlogs;

Due to this distinction, the update_backlog atomic need to be updated
separately, when the _view_update_concurrency_sem changes.
This is done by calling storage_proxy::get_view_update_backlog, which reads the _view_update_concurrency_sem of the shard (in database::get_view_update_backlog)
and then calls node_update_backlog::add_fetch where the read backlog
is stored in the atomic:

 db::view::update_backlog storage_proxy::get_view_update_backlog() const { 
     return _max_view_update_backlog.add_fetch(this_shard_id(), get_db().local().get_view_update_backlog()); 
 }

 update_backlog node_update_backlog::add_fetch(unsigned shard, update_backlog backlog) { 
     _backlogs[shard].backlog.store(backlog, std::memory_order_relaxed);

For this implementation of calculating the node's view update backlog to work,
we need the atomics to be updated correctly when the semaphores of corresponding
shards change.

The main event where the view update backlog changes is an incoming write
request. That's why when handling the request and preparing a response
we update the backlog calling storage_proxy::get_view_update_backlog (also
because we want to read the backlog and send it in the response):
backlog update after local view updates (storage_proxy::send_to_live_endpoints in mutate_begin)

 auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable { 
     return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state()) 
             .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] { 
         // make mutation alive until it is processed locally, otherwise it 
         // may disappear if write timeouts before this future is ready 
         got_response(response_id, my_address, get_view_update_backlog()); 
     }); 
 };

backlog update after remote view updates (storage_proxy::remote::handle_write)

 auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr, 
         shard, response_id, p->get_view_update_backlog()));

Now assume that on a certain node we have a write request received on shard A,
which updates a row on shard B (A!=B). As a result, shard B will generate view
updates and consume units from its _view_update_concurrency_sem, but will
not update its atomic in _backlogs yet. Because both shards in the example
are on the same node, shard A will perform a local write calling lmutate shown
above. In the lmutate call, the apply_locally will initiate the actual write on
shard B and the storage_proxy::get_view_update_backlog will be called back
on shard A. In no place will the backlog atomic on shard B get updated even
though it increased in size due to the view updates generated there.
Currently, what we calculate there doesn't really matter - it's only used for the
MV flow control delays, so currently, in this scenario, we may only overload
a replica causing failed replica writes which will be later retried as hints. However,
when we add MV admission control, the calculated backlog will be the difference
between an accepted and a rejected request.

The fix for this correctness issue we need to update the view update backlog
(call storage_proxy::get_view_update_backlog) on the shard whose semaphore
units have been consumed. We could still read the backlog on the shard that
received the initial request (also using storage_proxy::get_view_update_backlog),
however, if we read the backlog on a different shard than the one which generated
the view updates, it will most likely be outdated (by 10ms), so we'll likely won't see
the backlog change caused by the write request. If we want to see that, we need
to get the return updated value from the shard that generated the view updates
and return it on the shard handling the request.

This patch ensures that when semaphore units are consumed on some shard, it's
atomic backlog is updated and that when a backlogs increases as a result of some
request, the increased backlog is returned in the response of this request. This
is achieved by calling storage_proxy::get_view_update_backlog on the shard
where the view update semaphore units are consumed and then returning the
received value through the call stack until the response is sent. The exact place
where the get_view_update_backlog call is added was chosen due to performance
considerations: we don't want to create a new task on the common write path
just for this, so it's been embedded deeply into view_update_generator::generate_and_propagate_view_updates, so that it's only
called when we actually generate view updates and it's performed as a part of a
larger coroutine, not a new continuation.

Perf simple query in write mode gives the following results:
Before:
median 506446.36 tps ( 59.3 allocs/op, 16.0 logallocs/op, 15.0 tasks/op, 52680 insns/op, 0 errors)
median absolute deviation: 5352.38
maximum: 521346.01
minimum: 500055.56
After:
median 523666.95 tps ( 59.3 allocs/op, 16.0 logallocs/op, 15.0 tasks/op, 52457 insns/op, 0 errors)
median absolute deviation: 7856.04
maximum: 547143.16
minimum: 451149.62

Fixes: #18542

Without admission control (#18334), this patch doesn't affect much, so I'm marking it as backport/none

wmitros · 2024-05-13T11:40:11Z

I created this as a draft for now, as I still need to run a performance test for it (this affects the common write path). I'm also not entirely content with how verbose the fix became, so if you have an idea on some approach other than changing all the mutate_locally variants, please share

wmitros · 2024-05-14T07:46:41Z

I added the missing perf_simple_query results so I'm changing this to a PR.

scylladb-promoter · 2024-05-14T08:58:10Z

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-14T08:56:56.054Z] FAILED: build/dev/service/paxos/paxos_state.hh.o 
[2024-05-14T08:56:56.054Z] clang++ -MD -MT build/dev/service/paxos/paxos_state.hh.o -MF build/dev/service/paxos/paxos_state.hh.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/dev/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -O2 -DSCYLLA_BUILD_MODE=dev -iquote. -iquote build/dev/gen  --include service/paxos/paxos_state.hh -c -o build/dev/service/paxos/paxos_state.hh.o build/dev/gen/empty.cc
[2024-05-14T08:56:56.055Z] In file included from <built-in>:1:
[2024-05-14T08:56:56.055Z] ./service/paxos/paxos_state.hh:122:43: error: no member named update_backlog in namespace db::view
[2024-05-14T08:56:56.055Z]     static future<std::optional<db::view::update_backlog>> learn(storage_proxy& sp, db::system_keyspace& sys_ks, schema_ptr schema, proposal decision, clock_type::time_point timeout, tracing::trace_state_ptr tr_state);
[2024-05-14T08:56:56.055Z]                                 ~~~~~~~~~~^
[2024-05-14T08:56:56.055Z] 1 error generated.

Build Details:

Duration: 1 hr 10 min
Builder: i-0d5f2cdfa65ecfc23 (m5d.12xlarge)

wmitros · 2024-05-15T09:59:29Z

To avoid an additional allocation, I moved the view update backlog getter (which updates the backlog) from an extra then() to the database::apply coroutine. As a result, perf-simple-query shows no degradation now (actually, I'm consistently getting ~200 insns/op fewer than without the patch)

scylladb-promoter · 2024-05-15T12:26:10Z

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-15T12:24:03.871Z] FAILED: build/dev/service/paxos/paxos_state.hh.o 
[2024-05-15T12:24:03.871Z] clang++ -MD -MT build/dev/service/paxos/paxos_state.hh.o -MF build/dev/service/paxos/paxos_state.hh.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/dev/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -O2 -DSCYLLA_BUILD_MODE=dev -iquote. -iquote build/dev/gen  --include service/paxos/paxos_state.hh -c -o build/dev/service/paxos/paxos_state.hh.o build/dev/gen/empty.cc
[2024-05-15T12:24:03.873Z] In file included from <built-in>:1:
[2024-05-15T12:24:03.873Z] ./service/paxos/paxos_state.hh:122:43: error: no member named update_backlog in namespace db::view
[2024-05-15T12:24:03.873Z]     static future<std::optional<db::view::update_backlog>> learn(storage_proxy& sp, db::system_keyspace& sys_ks, schema_ptr schema, proposal decision, clock_type::time_point timeout, tracing::trace_state_ptr tr_state);
[2024-05-15T12:24:03.874Z]                                 ~~~~~~~~~~^
[2024-05-15T12:24:03.874Z] 1 error generated.

Build Details:

Duration: 2 hr 27 min
Builder: i-06c1e5c5a955cecc7 (m5ad.8xlarge)

xemul · 2024-05-15T15:13:56Z

replica/database.cc

@@ -2008,6 +2010,7 @@ future<> database::do_apply(schema_ptr s, const frozen_mutation& m, tracing::tra
            co_await coroutine::return_exception_ptr(std::move(ex));
        }
        lock = lock_f.get();
+        backlog = _view_update_generator->get_proxy().get_view_update_backlog();


That's horrible layering violation :( Can view_update_generator::push_view_replica_updates() call _proxy.get_view_update_backlog() on its own and return it back to database (and then up the chain) at least?

I would ask for the almost opposite thing, which I think will remove 99% of this patch:

Instead of having all this code return the "view update backlog" and changing many functions to pass this backlog up the chain, can't we have have the code that needs it call _view_update_generator->get_proxy().get_view_update_backlog() to get it?

Doing whatever you want to do with the backlog can be the last thing you do on the right shard, anyway it accesses a global (shard-local) variable. Why do we need to propagate it all through the call stack?

can't we have have the code that needs it call _view_update_generator->get_proxy().get_view_update_backlog() to get it?

This is what the code does today and this patch fixes it like this

@@ -4070,10 +4075,10 @@ void storage_proxy::send_to_live_endpoints(storage_proxy::response_id_type respo // lambda for applying mutation locally auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable { return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state()) - .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] { + .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] (std::optional<db::view::update_backlog> b) { // make mutation alive until it is processed locally, otherwise it // may disappear if write timeouts before this future is ready - got_response(response_id, my_address, get_view_update_backlog()); + got_response(response_id, my_address, b); }); };

so your proposal is to leave things as is :)

... so your proposal is to leave things as is :)

And I second that suggestion, by the way

That's horrible layering violation :( Can view_update_generator::push_view_replica_updates() call _proxy.get_view_update_backlog() on its own and return it back to database (and then up the chain) at least?

push_view_replica_updates() is actually a method of table so I suppose we don't want to refer to the storage_proxy there either, so I moved it to view_update_generator::generate_and_propagate_view_updates

I would ask for the almost opposite thing, which I think will remove 99% of this patch:

Instead of having all this code return the "view update backlog" and changing many functions to pass this backlog up the chain, can't we have have the code that needs it call _view_update_generator->get_proxy().get_view_update_backlog() to get it?

Instead of this entire chain we could have the code that needs it call _db.invoke_on(..., get_view_update_backlog()) but I expect we don't want to add it to the common write path due to performance considerations

Since storage_proxy::get_view_update_backlog() is using atomics to fetch-add its numbers, it can be safely used from any shard. So instead of propagating the result of this function all over the code, we (probably) can teach proxy to call its get_view_update_backlog() with int shard argument, this won't cost us extra submit-to-s

storage_proxy::get_view_update_backlog() is using atomics to store and read the cached backlogs, but the value to be stored in the atomics comes from database::get_view_update_backlog which reads the semaphore of the shard it's called on, so to use that we would need the invoke_on

Since storage_proxy::get_view_update_backlog() is using atomics to fetch-add its numbers, it can be safely used from any shard. So instead of propagating the result of this function all over the code, we (probably) can teach proxy to call its get_view_update_backlog() with int shard argument, this won't cost us extra submit-to-s

@xemul Each shard's atomic is currently written only by its shard and occasionally (every 10ms) read by other shards. Consider a non-shard-aware workload - won't this have a significant performance impact if multiple shards will be allowed to write to the same atomic on each write operation?

xemul · 2024-05-15T15:15:58Z

replica/database.cc

@@ -2077,21 +2081,21 @@ void database::update_write_metrics_for_timed_out_write() {
    ++_stats->total_writes_timedout;
 }

-future<> database::apply(schema_ptr s, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, db::timeout_clock::time_point timeout, db::per_partition_rate_limit::info rate_limit_info) {
+future<std::optional<db::view::update_backlog>> database::apply(schema_ptr s, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, db::timeout_clock::time_point timeout, db::per_partition_rate_limit::info rate_limit_info) {


nit: 99% of this patch is changing the return future of database::apply() (and its callers) to return std::optionaldb::view::update_backlog> and only 1% is the fix itself (in storage_proxy::send_to_live_endpoints() below). It would be great if this patch is split into two -- first with changing the return type, second with the essence of the fix itself

I split the patch in the rebase

scylladb-promoter · 2024-05-15T22:29:49Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/batchlog_manager_test
🔹 boost/database_test
🔹 boost/repair_test
🔹 topology_custom/test_mv_backlog
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 8 hr 39 min
Builder: i-0fba5819eb748e143 (m5ad.12xlarge)

nyh · 2024-05-16T11:18:55Z

I'm still trying to wrap my head about this PR (and the issue it tries to fix), without success. You used the phrase "coordinator shard" multiple times. But a coordinator doesn't have a "view update backlog" - only a replica does. A replica shard has its view update backlog - the number of view updates that it sent and haven't completed yet. And as you noted an entire node as a replica node has "maximum view update backlog" which is the maximum backlog of all its shards.

That's it - a coordinator doesn't maintain a view update backlog for itself. It only collects estimates of view update backlogs for replica nodes (not shards!) so it can calculate its delay. Unless I'm completely confused, there isn't any sense in which "coordinator shards" need to update their backlogs, which you say you are fixing.

nyh · 2024-05-16T11:52:39Z

replica/database.cc

@@ -2008,6 +2010,7 @@ future<> database::do_apply(schema_ptr s, const frozen_mutation& m, tracing::tra
            co_await coroutine::return_exception_ptr(std::move(ex));
        }
        lock = lock_f.get();
+        backlog = _view_update_generator->get_proxy().get_view_update_backlog();


I would ask for the almost opposite thing, which I think will remove 99% of this patch:

Instead of having all this code return the "view update backlog" and changing many functions to pass this backlog up the chain, can't we have have the code that needs it call _view_update_generator->get_proxy().get_view_update_backlog() to get it?

Doing whatever you want to do with the backlog can be the last thing you do on the right shard, anyway it accesses a global (shard-local) variable. Why do we need to propagate it all through the call stack?

nyh · 2024-05-16T11:54:38Z

service/paxos/paxos_state.cc

+                    return utils::get_local_injector().inject("paxos_timeout_after_save_decision", timeout, [&sys_ks, &decision, schema, timeout, backlog] {
+                        return sys_ks.save_paxos_decision(*schema, decision, timeout).then([backlog] {
+                            return make_ready_future<std::optional<db::view::update_backlog>>(std::move(backlog));
+                        });


It looks like here you duplicated code which already existed below, but why, and how is it related to this patch? If this is an incidental fix of a bug (?) please put it in a separate patch in the same PR and explain it separately in a commit message.

I duplicated the code here because, as a result of this change, the f = f.then(...) now changes the type of f. I used another approach in the rebase

nyh · 2024-05-16T11:56:02Z

service/paxos/paxos_state.hh

@@ -119,7 +123,7 @@ public:
    static future<bool> accept(storage_proxy& sp, db::system_keyspace& sys_ks, tracing::trace_state_ptr tr_state, schema_ptr schema, dht::token token, const proposal& proposal,
            clock_type::time_point timeout);
    // Replica RPC endpoint for Paxos "learn".
-    static future<> learn(storage_proxy& sp, db::system_keyspace& sys_ks, schema_ptr schema, proposal decision, clock_type::time_point timeout, tracing::trace_state_ptr tr_state);
+    static future<std::optional<db::view::update_backlog>> learn(storage_proxy& sp, db::system_keyspace& sys_ks, schema_ptr schema, proposal decision, clock_type::time_point timeout, tracing::trace_state_ptr tr_state);


I hope we can find a way to fix whatever this PR is fixing without adding the concept of "view update backlog" to a gazillion other things like replica::update() and paxos_state. MV is complex enough without intertwining it also with LWT if we don't have to :-(

As long as we can have MVs on tables on which we can perform LWT queries I'm not sure we can avoid this. We still want to have MV flow control and (future) admission control even if we're only doing LWTs

nyh · 2024-05-16T12:02:04Z

service/storage_proxy.cc

+    });
+    _metrics.add_group(storage_proxy_stats::REPLICA_STATS_CATEGORY, {
+        sm::make_current_bytes("view_update_backlog", [this] { return _max_view_update_backlog.fetch_shard(this_shard_id()).current; },
+                       sm::description("View update backlog cached size")),


Why did you remove the current_throttled_writes statistics? Did you remove it by mistake?
And what's the difference between the two metrics you added here? Maybe part of the same mistake?

Also, please expand on the description "View update backlog cached size". I have no idea what this means, and neither will users.

I replaced the current_throttled_writes by an extra repetition of the new metric by mistake, sorry.
I added this metric mostly to enable testing this change, but I improved the description slightly in the rebase anyway

nyh · 2024-05-16T12:05:59Z

test/topology_custom/test_mv_backlog.py

+        await cql.run_async(f"INSERT INTO ks.tab (key, c, v) VALUES ({0}, {i}, '{v*'a'}')")
+        # The view update backlog should increase on the node generating view updates
+        view_backlog = get_metric(local_node.address, 'scylla_storage_proxy_replica_view_update_backlog')
+        assert view_backlog > v


Please note that delay_before_remote_view_update only delays by 500ms, not by infinity. I guess it will be enough to notice the backlog in 99.9% of the time, but I hope we won't have a crazy-overloaded-super-slow test machine where the 500ms delay passes before we manage to call get_metric() :-(

I wonder I we can check somehow how feasible is this scenario, though won't many tests fail on such a machine anyway due to timeouts?

nyh · 2024-05-16T12:10:51Z

test/topology_custom/test_mv_backlog.py

+# This test reproduces issue #18542
+# In the test, we create a table and perform a write to it a couple of times
+# Each time, we check that a view update backlog on some shard increased
+# due to the write.


I'm afraid I don't fully understand this test. First of all, the name of the test is "*_on_correct_shard" but the test doesn't seem to check any specific to shards, correct or not - just the node's total. The test's comment makes more sense than the test's name (it mentions "some shard", not correct shard).
Second, can you say a few words (here in the discussion or in the commit message) on how this test failed before this PR? What actually happened? Was the view update backlog seen by this test zero? Why?

I doubt I can explain this better as a response to a comment than what I wrote in the issue but hopefully by answering questions I can provide the missing insight

Second, can you say a few words (here in the discussion or in the commit message) on how this test failed before this PR? What actually happened? Was the view update backlog seen by this test zero? Why?

Right, the backlog seen by the test was zero. This happened because of the following:
0) all nodes started with empty backlogs

first node got the write request and became the coordinator of the request

the coordinator sent a replica write request to the second node

the second node received the replica write request on a shard different than the shard that contained the data that needed to be modified. (When I was referring to the coordinator shard in the description I meant that shard which received the write request on the right replica.)

the second node (on coordinator shard) invoked the correct shard to apply the write

the correct shard applied the write, generating view updates and consuming units from the view_update_concurrency_semaphore (even though the view update backlog is based on this semaphore, this is not increasing the view update backlog propagated in responses and gossip.)

the coordinator shard on the second node finished the invocation on the shard with updated data and then called storage_proxy::get_view_update_backlog using its result for the write response.

the coordinator node received the response and updates its saved max backlog for the second node

The problem is that on the shard whose backlog in fact increased, we never updated the backlog used for the responses and gossip.

There can be a few sources of confusion here:

for each shard on each node there are actually 2 view update backlogs:

scylladb/replica/database.hh

Lines 1847 to 1849 in 2d91422

db::view::update_backlog get_view_update_backlog() const {

return {max_memory_pending_view_updates() - _view_update_concurrency_sem.current(), max_memory_pending_view_updates()};

}

scylladb/db/view/node_view_update_backlog.hh

Line 36 in 2d91422

std::vector<per_shard_backlog> _backlogs;

The i. backlog is always up-to-date (it's the source of truth). The ii. backlog needs to be updated, but it's the backlog which we're using to select a max backlog to use in a response or in gossip.

the ii. backlog is updated using the backlog getter:

scylladb/service/storage_proxy.cc

Lines 2449 to 2451 in 2d91422

db::view::update_backlog storage_proxy::get_view_update_backlog() const {

return _max_view_update_backlog.add_fetch(this_shard_id(), get_db().local().get_view_update_backlog());

}

scylladb/db/view/view.cc

Lines 2612 to 2613 in 2d91422

update_backlog node_update_backlog::add_fetch(unsigned shard, update_backlog backlog) {

_backlogs[shard].backlog.store(backlog, std::memory_order_relaxed);

As a result, if we never "get" the view update backlog on the shard whose i. backlog increased, the ii. backlog for that shard will never increase, and other nodes will never get the information about a potential overload.

wmitros · 2024-05-16T23:09:39Z

I'm still trying to wrap my head about this PR (and the issue it tries to fix), without success. You used the phrase "coordinator shard" multiple times.

I tried to mention what I meant by "coordinator shard" in #18646 (comment) actually not a shard of the coordinator node but the shard which handles the write on the actual replica (though I heard it's usually the same shard on which the coordinator node received the request))

But a coordinator doesn't have a "view update backlog" - only a replica does.
A replica shard has its view update backlog - the number of view updates that it sent and haven't completed yet. And as you noted an entire node as a replica node has "maximum view update backlog" which is the maximum backlog of all its shards.

That's it - a coordinator doesn't maintain a view update backlog for itself. It only collects estimates of view update backlogs for replica nodes (not shards!) so it can calculate its delay. Unless I'm completely confused, there isn't any sense in which "coordinator shards" need to update their backlogs, which you say you are fixing.

I mostly repeated the issue description there, but maybe #18646 (comment) helps a bit with the confusion

By the way I'm also not a big fan of how this was implemented but I'm not really seeing a better way of achieving the effect of getting the result of a get_view_update_backlog call to the right place. I also considered adding an output parameter instead of changing the return types of all these methods, or creating a duplicate set of methods only for the places which need to get the backlog (for write responses), but both seemed even worse.

The "cleanest" method would be adding _db.invoke_on(..., get_view_update_backlog()) just in lmutate and receive_mutation_handler or mutate_locally, but it would be expensive for the common write path, which I don't think we want to slow down even slightly

scylladb-promoter · 2024-05-17T03:28:40Z

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-17T03:25:06.525Z] FAILED: build/debug/replica/database.o 
[2024-05-17T03:25:06.526Z] clang++ -MD -MT build/debug/replica/database.o -MF build/debug/replica/database.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN  -c -o build/debug/replica/database.o replica/database.cc
[2024-05-17T03:25:06.529Z] In file included from replica/database.cc:52:
[2024-05-17T03:25:06.529Z] ./service/storage_proxy.hh:488:30: error: class member cannot be redeclared
[2024-05-17T03:25:06.529Z]     db::view::update_backlog get_view_update_backlog() const;
[2024-05-17T03:25:06.529Z]                              ^
[2024-05-17T03:25:06.529Z] ./service/storage_proxy.hh:234:30: note: previous declaration is here
[2024-05-17T03:25:06.529Z]     db::view::update_backlog get_view_update_backlog() const;
[2024-05-17T03:25:06.529Z]                              ^
[2024-05-17T03:25:06.529Z] 1 error generated.

Build Details:

Duration: 4 hr 16 min
Builder: spider7.cloudius-systems.com

xemul · 2024-05-17T10:49:20Z

db/view/view.cc

@@ -1845,6 +1845,10 @@ future<> view_update_generator::mutate_MV(
    });
 }

+update_backlog view_update_generator::get_view_update_backlog() {


there's db/view/view_update_generator.cc for view_update_generator methods

I saw it but it looks like we didn't move all of its methods there (view_update_generator::mutate_MV is still here). I assumed we didn't want to include storage_proxy.hh in view_update_generator.cc and that's why mutate_MV was left, but after the last rebase I see storage_proxy.hh is already in view_update_generator.cc anyway, so I don't need to put it in db/view/view.cc after all

scylladb-promoter · 2024-05-17T22:03:31Z

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/batchlog_manager_test
🔹 boost/database_test
🔹 boost/repair_test
🔹 topology_custom/test_mv_backlog

Failed Tests (2/650):

Build Details:

Duration: 7 hr 37 min
Builder: i-058b7639f8fae7d95 (r5ad.8xlarge)

wmitros · 2024-05-20T10:57:30Z

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 boost/batchlog_manager_test 🔹 boost/database_test 🔹 boost/repair_test 🔹 topology_custom/test_mv_backlog

Failed Tests (2/650):

test_update_view_backlog_on_correct_shard 🔍

topology_custom.test_mv_backlog.debug.100 🔍

It might have been something specific to the CI (the test failed on the last, 100th repetition), but it might have just caused by #18646 (comment) so I added a longer delay for the test.

I updated the PR description with a lot more context. @xemul @nyh please take a look if it makes the issue more clear

scylladb-promoter · 2024-05-20T23:28:44Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/batchlog_manager_test
🔹 boost/database_test
🔹 boost/repair_test
🔹 topology_custom/test_mv_backlog
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 5 hr 13 min
Builder: i-0ffbff6ddb0d95fb8 (i4i.8xlarge)

piodul · 2024-05-21T13:09:44Z

@wmitros the links to the code you added in your updated description will not be rendered as code snippets in the merge commit. Please replace them with actual snippets of code.

To get the view update backlog in a response, we need to update it on the shard which contained the altered data. The write is redirected to the correct shard in storage_proxy::mutate_locally, where database::apply is invoked on the correct shard. However, we don't want to also invoke the storage_proxy::get_view_update_backlog on the correct shard there, because that would cost us an additional allocation (a .then() following the db.apply()) on the common write path. More precisely, as discovered using perf-simple-query, for a single operation, we get from 59 allocs, 15 tasks and ~52450 instructions to 60 allocs, 16 tasks and ~52950 instructions. Instead, we can add the backlog update inside the database::apply call. However, the call shouldn't be directly a part of apply() method, as we don't want to refer to the storage_proxy there. It also shouldn't be a part of the table::push_view_replica_updates() method for the same reason. As a result, it needs to be returned from view_update_generator::generate_and_propagate_view_updates(), and passed as a return value through the call stack. This patch prepares all methods in the stack by changing their return value types. For now, the returned values are empty - the actual value to return will be added in a following patch.

When performing a write, we should update the view update backlog on the node and shard where the mutation is actually applied, and potentially also on the coordinator of this write, with the updated value from the replica. Currently, instead of updating the backlog on the shard that applies the mutation, we do it on the coordinator shard of this write, and as a result, the backlog on the correct shard is not updated. This patch enables updating the backlog on the correct shard. To achieve that, the update is performed just after the views are pushed, in view_update_generator::generate_and_propagate_view_updates on the correct shard, and later propagated as a return value to be finally used for updating the max view backlog of the node.

This patch adds a test for reproducing issue scylladb#18542 The test performs writes on a table with a materialized view and checks that the view backlog increases. To get the current view update backlog, a new metric "view_update_backlog" is added to the `storage_proxy` metrics. The metric differs from the metric from `database` metric with the same name by taking the backlog from the max_view_update_backlog which keeps view update backlogs from all shards which may be a bit outdated, instead of taking the backlog by checking the view_update_semaphore which the backlog is based on directly.

wmitros · 2024-05-21T15:03:29Z

@wmitros the links to the code you added in your updated description will not be rendered as code snippets in the merge commit. Please replace them with actual snippets of code.

I've updated the description and fixed the merge conflicts. After the rebase, perf-simple-query shows the same (minor) difference

github-actions bot added the area/materialized views label May 13, 2024

wmitros force-pushed the update-correct-shard-backlog branch 2 times, most recently from 15c5afc to 13b3f41 Compare May 13, 2024 12:29

wmitros marked this pull request as ready for review May 14, 2024 07:46

wmitros requested review from tgrabiec, nyh and piodul as code owners May 14, 2024 07:46

wmitros mentioned this pull request May 14, 2024

mv: gossip the same backlog if a different backlog was sent in a response #18663

Open

wmitros added the backport/none Backport is not required label May 14, 2024

wmitros force-pushed the update-correct-shard-backlog branch from 13b3f41 to cc83a47 Compare May 15, 2024 09:55

wmitros force-pushed the update-correct-shard-backlog branch from cc83a47 to 4df3039 Compare May 15, 2024 12:42

xemul reviewed May 15, 2024

View reviewed changes

nyh reviewed May 16, 2024

View reviewed changes

wmitros force-pushed the update-correct-shard-backlog branch from 4df3039 to 2e03afa Compare May 16, 2024 23:07

wmitros force-pushed the update-correct-shard-backlog branch from 2e03afa to ad3574f Compare May 16, 2024 23:11

wmitros force-pushed the update-correct-shard-backlog branch 2 times, most recently from 48cb974 to a471fc9 Compare May 17, 2024 08:41

xemul reviewed May 17, 2024

View reviewed changes

wmitros force-pushed the update-correct-shard-backlog branch from a471fc9 to a9bd6d3 Compare May 20, 2024 10:55

wmitros force-pushed the update-correct-shard-backlog branch from a9bd6d3 to ca72c20 Compare May 21, 2024 14:40

wmitros added 2 commits May 21, 2024 16:48

wmitros force-pushed the update-correct-shard-backlog branch from ca72c20 to 9ea788c Compare May 21, 2024 14:53

wmitros mentioned this pull request May 21, 2024

mv: update the backlogs when view updates finish #18804

Open

	db::view::update_backlog get_view_update_backlog() const {
	return {max_memory_pending_view_updates() - _view_update_concurrency_sem.current(), max_memory_pending_view_updates()};
	}

	db::view::update_backlog storage_proxy::get_view_update_backlog() const {
	return _max_view_update_backlog.add_fetch(this_shard_id(), get_db().local().get_view_update_backlog());
	}

	update_backlog node_update_backlog::add_fetch(unsigned shard, update_backlog backlog) {
	_backlogs[shard].backlog.store(backlog, std::memory_order_relaxed);

storage_proxy: update view update backlog on correct shard when writing #18646

Are you sure you want to change the base?

storage_proxy: update view update backlog on correct shard when writing #18646

Conversation

wmitros commented May 13, 2024 • edited

wmitros commented May 13, 2024

wmitros commented May 14, 2024

scylladb-promoter commented May 14, 2024

🔴 CI State: FAILURE

Build Failure:

Build Details:

wmitros commented May 15, 2024 • edited

scylladb-promoter commented May 15, 2024

🔴 CI State: FAILURE

Build Failure:

Build Details:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scylladb-promoter commented May 15, 2024

🟢 CI State: SUCCESS

Build Details:

nyh commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wmitros commented May 16, 2024

scylladb-promoter commented May 17, 2024

🔴 CI State: FAILURE

Build Failure:

Build Details:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scylladb-promoter commented May 17, 2024

🔴 CI State: FAILURE

Failed Tests (2/650):

Build Details:

wmitros commented May 20, 2024

🔴 CI State: FAILURE

Failed Tests (2/650):

scylladb-promoter commented May 20, 2024

🟢 CI State: SUCCESS

Build Details:

piodul commented May 21, 2024

wmitros commented May 21, 2024

wmitros commented May 13, 2024 •

edited

wmitros commented May 15, 2024 •

edited