mv: gossip the same backlog if a different backlog was sent in a response #18663

wmitros · 2024-05-14T07:54:41Z

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog - this can be observed in the following scenario:

Cluster starts, all nodes gossip their empty view update backlog to one another
On node N, view_update_backlog_broker (the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
In the next and following iterations of view_update_backlog_broker on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the check

auto backlog = _sp.local().get_view_update_backlog(); 
if (backlog_published && *backlog_published == backlog) { 
    sleep_abortable(gms::gossiper::INTERVAL, _as).get(); 
    continue; 
}

After this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty

This patch fixes the issue this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: #18461

Similarly to #18646, without admission control (#18334), this patch doesn't affect much, so I'm marking it as backport/none

Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests

wmitros · 2024-05-14T07:55:25Z

This patch only concerns the last commit. Is based on #18646 to avoid later conflicts

scylladb-promoter · 2024-05-14T08:37:17Z

🔴 CI State: FAILURE

❌ - Build

Build Details:

Duration: 40 min
Builder: i-032a01f421ac12cb1 (r5ad.8xlarge)

wmitros · 2024-05-15T10:16:56Z

Rebased on updated #18646 and fixed compilation error

scylladb-promoter · 2024-05-15T10:51:25Z

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-15T10:43:13.261Z] FAILED: build/debug/replica/database.o 
[2024-05-15T10:43:13.261Z] clang++ -MD -MT build/debug/replica/database.o -MF build/debug/replica/database.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN  -c -o build/debug/replica/database.o replica/database.cc
[2024-05-15T10:43:13.263Z] replica/database.cc:2026:55: error: get_view_update_backlog_for_response is a private member of service::storage_proxy
[2024-05-15T10:43:13.264Z]         backlog = _view_update_generator->get_proxy().get_view_update_backlog_for_response();
[2024-05-15T10:43:13.264Z]                                                       ^
[2024-05-15T10:43:13.264Z] ./service/storage_proxy.hh:442:30: note: declared private here
[2024-05-15T10:43:13.264Z]     db::view::update_backlog get_view_update_backlog_for_response() const;
[2024-05-15T10:43:13.264Z]                              ^
[2024-05-15T10:43:13.264Z] 1 error generated.

Build Details:

Duration: 34 min
Builder: i-01f580a432a2fd8cf (m5ad.8xlarge)

scylladb-promoter · 2024-05-16T01:14:59Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/batchlog_manager_test
🔹 boost/database_test
🔹 boost/repair_test
🔹 topology_custom/test_mv_backlog
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 5 hr 4 min
Builder: spider8.cloudius-systems.com

piodul · 2024-05-16T08:01:15Z

db/view/node_view_update_backlog.hh

@@ -37,16 +39,20 @@ class node_update_backlog {
    std::chrono::milliseconds _interval;
    std::atomic<clock::time_point> _last_update;
    std::atomic<update_backlog> _max;
+    std::vector<std::atomic<need_publishing>> _need_publishing;


These need_publishing flags will be put very close together. Even though each one will be touched 99% of the time by a single shard, due to the fact that they will most likely land in a single cache performance will be affected due to false sharing.

It would be much better to put this into the per_shard_backlog, right after the backlog field. The alignment on backlog will prevent the false sharing issues.

Moreover, I'm not sure these have to be atomics. You could change add_fetch_for_gossip to perform invoke_on_all to gather values on all shards. It only runs once a second, so it's fine.

piodul · 2024-05-16T08:24:13Z

service/storage_proxy.cc

+std::optional<db::view::update_backlog> storage_proxy::get_view_update_backlog_for_gossip(std::optional<db::view::update_backlog> last_backlog) const {
+    return _max_view_update_backlog.add_fetch_for_gossip(this_shard_id(), get_db().local().get_view_update_backlog(), std::move(last_backlog));
 }


Gossiper runs on shard 0, so the shard parameter is unneeded. It would make more sense to check the shard and do on_internal_error is it is not 0.

piodul · 2024-05-16T08:41:46Z

db/view/view.cc

+        }
+        need_publishing = np.exchange(need_publishing::no, std::memory_order_relaxed);
+    }
+    auto max = add_fetch(shard, backlog);


I know that the previous code used to do this, but do we really have to? Wouldn't be sufficient for the gossiper fiber to only gather data and not update the local shard's data?

Makes sense

piodul · 2024-05-16T08:48:41Z

service/storage_proxy.hh

-    db::view::update_backlog get_view_update_backlog() const;
+    db::view::update_backlog get_view_update_backlog_for_response() const;
+
+    std::optional<db::view::update_backlog> get_view_update_backlog_for_gossip(std::optional<db::view::update_backlog> last_backlog) const;


I don't like these names because they don't really tell much about what those functions do. It's not obvious what are the requirements of "response" or "gossip".

I changed the names a bit

piodul · 2024-05-16T08:50:20Z

db/view/view.cc

+    if (!need_publishing && last_backlog && *last_backlog == max) {
+        return std::nullopt;
+    } else {
+        return max;
+    }
+}


Do we have to check agains the last backlog at all? If the value has changed then one of the need_publishing flags must have been updated, no?

piodul · 2024-05-21T13:22:38Z

However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.

Your description is very information-dense and thus difficult to parse. In such cases, an example usually helps in understanding the bug. The example that you already provided in the description of the issue was, in my opinion, quite clear, so you could write an example here based on that.

To get the view update backlog in a response, we need to update it on the shard which contained the altered data. The write is redirected to the correct shard in storage_proxy::mutate_locally, where database::apply is invoked on the correct shard. However, we don't want to also invoke the storage_proxy::get_view_update_backlog on the correct shard there, because that would cost us an additional allocation (a .then() following the db.apply()) on the common write path. More precisely, as discovered using perf-simple-query, for a single operation, we get from 59 allocs, 15 tasks and ~52450 instructions to 60 allocs, 16 tasks and ~52950 instructions. Instead, we can add the backlog update inside the database::apply call. However, the call shouldn't be directly a part of apply() method, as we don't want to refer to the storage_proxy there. It also shouldn't be a part of the table::push_view_replica_updates() method for the same reason. As a result, it needs to be returned from view_update_generator::generate_and_propagate_view_updates(), and passed as a return value through the call stack. This patch prepares all methods in the stack by changing their return value types. For now, the returned values are empty - the actual value to return will be added in a following patch.

When performing a write, we should update the view update backlog on the node and shard where the mutation is actually applied, and potentially also on the coordinator of this write, with the updated value from the replica. Currently, instead of updating the backlog on the shard that applies the mutation, we do it on the coordinator shard of this write, and as a result, the backlog on the correct shard is not updated. This patch enables updating the backlog on the correct shard. To achieve that, the update is performed just after the views are pushed, in view_update_generator::generate_and_propagate_view_updates on the correct shard, and later propagated as a return value to be finally used for updating the max view backlog of the node.

This patch adds a test for reproducing issue scylladb#18542 The test performs writes on a table with a materialized view and checks that the view backlog increases. To get the current view update backlog, a new metric "view_update_backlog" is added to the `storage_proxy` metrics. The metric differs from the metric from `database` metric with the same name by taking the backlog from the max_view_update_backlog which keeps view update backlogs from all shards which may be a bit outdated, instead of taking the backlog by checking the view_update_semaphore which the backlog is based on directly.

wmitros · 2024-05-21T15:16:38Z

Your description is very information-dense and thus difficult to parse. In such cases, an example usually helps in understanding the bug. The example that you already provided in the description of the issue was, in my opinion, quite clear, so you could write an example here based on that.

I've mostly copied the example, hopefully that's good enough.

Also I rebased to fix the merge conflicts. At the same time, I added a commit for better readability to this PR so the patch is now the last 2 commits, not only the last one

scylladb-promoter · 2024-05-21T20:09:03Z

🔴 CI State: FAILURE

❌ - Build

Build Details:

Duration: 4 hr 36 min
Builder: i-0656e49f3aa4a8738 (m5d.8xlarge)

Currently, we only update the backlogs in node_update_backlog at the same time when we're fetching them. This is done using storage_proxy's method get_view_update_backlog, which is confusing because it's a getter with side-effects. Additionally, we don't always want to update the backlog when we're reading it (as in gossip which is only on shard 0) and we don't always want to read it when we're updating it (when we're not handling any writes but the backlog drops due to background work finish). This patch divides the node_view_backlog::add_fetch as well the storage_proxy::get_view_update_backlog both into two methods; one for updating and one for reading the backlog. This patch only replaces the places where we're currently using the view backlog getter, more situations where we should get/update the backlog should be considered in a following patch.

…onse Currently, there are 2 ways of sharing a backlog with other nodes: through a gossip mechanism, and with responses to replica writes. In gossip, we check each second if the backlog changed, and if it did we update other nodes with it. However if the backlog for this node changed on another node with a write response, the gossiped backlog is currently not updated, so if after the response the backlog goes back to the value from the previous gossip round, it will not get sent and the other node will stay with an outdated backlog. This patch changes this by notifying the gossip that a different backlog was sent in a response, causing it to send an unchanged backlog to other nodes in the following gossip round. Fixes: scylladb#18461

scylladb-promoter · 2024-05-22T08:03:55Z

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-22T08:02:37.215Z] FAILED: build/dev/db/view/node_view_update_backlog.hh.o 
[2024-05-22T08:02:37.215Z] clang++ -MD -MT build/dev/db/view/node_view_update_backlog.hh.o -MF build/dev/db/view/node_view_update_backlog.hh.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/dev/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -O2 -DSCYLLA_BUILD_MODE=dev -iquote. -iquote build/dev/gen  --include db/view/node_view_update_backlog.hh -c -o build/dev/db/view/node_view_update_backlog.hh.o build/dev/gen/empty.cc
[2024-05-22T08:02:37.216Z] In file included from <built-in>:1:
[2024-05-22T08:02:37.216Z] ./db/view/node_view_update_backlog.hh:55:5: error: no template named future; did you mean seastar::future?
[2024-05-22T08:02:37.217Z]     future<std::optional<update_backlog>> fetch_if_changed(std::optional<db::view::update_backlog> last_backlog);
[2024-05-22T08:02:37.217Z]     ^~~~~~
[2024-05-22T08:02:37.217Z]     seastar::future
[2024-05-22T08:02:37.217Z] /jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include/seastar/core/future.hh:1242:21: note: seastar::future declared here
[2024-05-22T08:02:37.217Z] class [[nodiscard]] future : private internal::future_base {
[2024-05-22T08:02:37.217Z]                     ^
[2024-05-22T08:02:37.217Z] 1 error generated.

Build Details:

Duration: 9 hr 46 min
Builder: i-011fe52b636329a1d (m5d.12xlarge)

piodul

The description looks better, thanks.

I reviewed the last commit. I left some comments. Please fix the nits, the compilation error and, if possible, rebase on master so that this commit does not require #18646.

Otherwise, looks good.

piodul · 2024-05-24T11:36:19Z

db/view/view.cc

+            });
+
+    auto max = fetch();
+    if (!np && last_backlog && *last_backlog == max) {


I asked about this before, but the comment was resolved.

Isn't it sufficient to just check:

if (!np) {

?

Previously, we used to compare the last backlog vs the current one in order to decide whether it needs publishing. Now, we have the need_publishing flag. Is the last_backlog && *last_backlog == max still necessary?

The whole if..else could be written as a ternary operator:

return np ? max : std::nullopt;

I didn't try to compile it, there is a chance that you'll get a type error because both branches don't have the same type, however it would be nice to try it.

I asked about this before, but the comment was resolved.

Isn't it sufficient to just check:

if (!np) {

?

Previously, we used to compare the last backlog vs the current one in order to decide whether it needs publishing. Now, we have the need_publishing flag. Is the last_backlog && *last_backlog == max still necessary?

After #18804 is merged and the need_publishing is also updated in that context, the last_backlog && *last_backlog == max check won't be needed anymore. Without #18804 the need_publishing flag is not set when a backlog drops, so we have no other way of detecting it other than comparing with the last sent backlog

piodul · 2024-05-24T11:38:16Z

db/view/view.cc

+                    if (_backlogs[shard].need_publishing) {
+                        _backlogs[shard].need_publishing = need_publishing::no;
+                        return need_publishing::yes;
+                    }
+                    return need_publishing::no;


Nit: you can use std::exchange to make the body of the lambda nice and short:

return std::exchange(_backlogs[shard].need_publishing, need_publishing::no);

piodul · 2024-05-24T11:40:32Z

service/storage_proxy.hh

@@ -235,6 +235,8 @@ public:
    // Get information about this node's view update backlog. It combines information from all local shards.
    db::view::update_backlog get_view_update_backlog();

+    future<std::optional<db::view::update_backlog>> get_view_update_backlog_if_changed(std::optional<db::view::update_backlog> last_backlog);


Nit: you require that this function must be called on shard 0 only. Please put this information in a comment near the signature so that the future users are not surprised.

piodul · 2024-05-24T11:48:59Z

db/view/view.cc

+                return a || b;
+            });
+
+    auto max = fetch();


This will be prone to races. Consider the following scenario:

_backlog is updated on shard 1, the need_publishing flag is set,

The _last_update is recalculated, next one will happen 10ms from now or later,

5ms pass,

fetch_if_changed and clears the need_publishing flag on shard 1, but uses fetch to read the old value,

There are no MV writes anymore and we didn't publish the updated backlog on shard 1.

You should collect the backlogs in the submit_to lambda and calculate the maximum in the reducer lambda (you should still collect the np flag in addition to this).

wmitros requested review from tgrabiec, nyh and piodul as code owners May 14, 2024 07:54

github-actions bot added the area/materialized views label May 14, 2024

wmitros added the backport/none Backport is not required label May 14, 2024

wmitros force-pushed the update-gossip-mv branch from c03c495 to 1422510 Compare May 15, 2024 10:16

wmitros force-pushed the update-gossip-mv branch 2 times, most recently from 6c91864 to 33f4a75 Compare May 15, 2024 12:43

piodul requested changes May 16, 2024

View reviewed changes

wmitros force-pushed the update-gossip-mv branch from 33f4a75 to 6e595ca Compare May 21, 2024 08:30

wmitros added 3 commits May 21, 2024 16:14

wmitros force-pushed the update-gossip-mv branch from 6e595ca to 70e95aa Compare May 21, 2024 15:14

wmitros force-pushed the update-gossip-mv branch from 70e95aa to f8cfff4 Compare May 21, 2024 15:32

wmitros mentioned this pull request May 21, 2024

mv: update the backlogs when view updates finish #18804

Open

wmitros added 2 commits May 22, 2024 00:13

wmitros force-pushed the update-gossip-mv branch from f8cfff4 to 4ca6c13 Compare May 21, 2024 22:13

piodul requested changes May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mv: gossip the same backlog if a different backlog was sent in a response #18663

mv: gossip the same backlog if a different backlog was sent in a response #18663

wmitros commented May 14, 2024 •

edited

wmitros commented May 14, 2024

scylladb-promoter commented May 14, 2024

wmitros commented May 15, 2024

scylladb-promoter commented May 15, 2024

scylladb-promoter commented May 16, 2024

piodul May 16, 2024

piodul May 16, 2024

wmitros May 21, 2024

piodul May 16, 2024

wmitros May 21, 2024

piodul May 16, 2024

wmitros May 21, 2024

piodul May 16, 2024

wmitros May 21, 2024

piodul May 16, 2024

piodul commented May 21, 2024

wmitros commented May 21, 2024

scylladb-promoter commented May 21, 2024

scylladb-promoter commented May 22, 2024

piodul left a comment

piodul May 24, 2024

piodul May 24, 2024

wmitros May 29, 2024

piodul May 24, 2024

piodul May 24, 2024

piodul May 24, 2024

mv: gossip the same backlog if a different backlog was sent in a response #18663

Are you sure you want to change the base?

mv: gossip the same backlog if a different backlog was sent in a response #18663

Conversation

wmitros commented May 14, 2024 • edited

wmitros commented May 14, 2024

scylladb-promoter commented May 14, 2024

🔴 CI State: FAILURE

Build Details:

wmitros commented May 15, 2024

scylladb-promoter commented May 15, 2024

🔴 CI State: FAILURE

Build Failure:

Build Details:

scylladb-promoter commented May 16, 2024

🟢 CI State: SUCCESS

Build Details:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piodul commented May 21, 2024

wmitros commented May 21, 2024

scylladb-promoter commented May 21, 2024

🔴 CI State: FAILURE

Build Details:

scylladb-promoter commented May 22, 2024

🔴 CI State: FAILURE

Build Failure:

Build Details:

piodul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wmitros commented May 14, 2024 •

edited