Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mv: gossip the same backlog if a different backlog was sent in a response #18663

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

wmitros
Copy link
Contributor

@wmitros wmitros commented May 14, 2024

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog - this can be observed in the following scenario:

  1. Cluster starts, all nodes gossip their empty view update backlog to one another
  2. On node N, view_update_backlog_broker (the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
  3. Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
  4. Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
  5. In the next and following iterations of view_update_backlog_broker on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the check
auto backlog = _sp.local().get_view_update_backlog(); 
if (backlog_published && *backlog_published == backlog) { 
    sleep_abortable(gms::gossiper::INTERVAL, _as).get(); 
    continue; 
} 

After this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty

This patch fixes the issue this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: #18461

Similarly to #18646, without admission control (#18334), this patch doesn't affect much, so I'm marking it as backport/none

Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests

@wmitros
Copy link
Contributor Author

wmitros commented May 14, 2024

This patch only concerns the last commit. Is based on #18646 to avoid later conflicts

@wmitros wmitros added the backport/none Backport is not required label May 14, 2024
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Details:

  • Duration: 40 min
  • Builder: i-032a01f421ac12cb1 (r5ad.8xlarge)

@wmitros
Copy link
Contributor Author

wmitros commented May 15, 2024

Rebased on updated #18646 and fixed compilation error

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-15T10:43:13.261Z] FAILED: build/debug/replica/database.o 
[2024-05-15T10:43:13.261Z] clang++ -MD -MT build/debug/replica/database.o -MF build/debug/replica/database.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN  -c -o build/debug/replica/database.o replica/database.cc
[2024-05-15T10:43:13.263Z] replica/database.cc:2026:55: error: get_view_update_backlog_for_response is a private member of service::storage_proxy
[2024-05-15T10:43:13.264Z]         backlog = _view_update_generator->get_proxy().get_view_update_backlog_for_response();
[2024-05-15T10:43:13.264Z]                                                       ^
[2024-05-15T10:43:13.264Z] ./service/storage_proxy.hh:442:30: note: declared private here
[2024-05-15T10:43:13.264Z]     db::view::update_backlog get_view_update_backlog_for_response() const;
[2024-05-15T10:43:13.264Z]                              ^
[2024-05-15T10:43:13.264Z] 1 error generated.

Build Details:

  • Duration: 34 min
  • Builder: i-01f580a432a2fd8cf (m5ad.8xlarge)

@wmitros wmitros force-pushed the update-gossip-mv branch 2 times, most recently from 6c91864 to 33f4a75 Compare May 15, 2024 12:43
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/batchlog_manager_test
🔹 boost/database_test
🔹 boost/repair_test
🔹 topology_custom/test_mv_backlog
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 5 hr 4 min
  • Builder: spider8.cloudius-systems.com

@@ -37,16 +39,20 @@ class node_update_backlog {
std::chrono::milliseconds _interval;
std::atomic<clock::time_point> _last_update;
std::atomic<update_backlog> _max;
std::vector<std::atomic<need_publishing>> _need_publishing;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need_publishing flags will be put very close together. Even though each one will be touched 99% of the time by a single shard, due to the fact that they will most likely land in a single cache performance will be affected due to false sharing.

It would be much better to put this into the per_shard_backlog, right after the backlog field. The alignment on backlog will prevent the false sharing issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, I'm not sure these have to be atomics. You could change add_fetch_for_gossip to perform invoke_on_all to gather values on all shards. It only runs once a second, so it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines 2454 to 2456
std::optional<db::view::update_backlog> storage_proxy::get_view_update_backlog_for_gossip(std::optional<db::view::update_backlog> last_backlog) const {
return _max_view_update_backlog.add_fetch_for_gossip(this_shard_id(), get_db().local().get_view_update_backlog(), std::move(last_backlog));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gossiper runs on shard 0, so the shard parameter is unneeded. It would make more sense to check the shard and do on_internal_error is it is not 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

db/view/view.cc Outdated
}
need_publishing = np.exchange(need_publishing::no, std::memory_order_relaxed);
}
auto max = add_fetch(shard, backlog);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the previous code used to do this, but do we really have to? Wouldn't be sufficient for the gossiper fiber to only gather data and not update the local shard's data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Comment on lines 483 to 485
db::view::update_backlog get_view_update_backlog() const;
db::view::update_backlog get_view_update_backlog_for_response() const;

std::optional<db::view::update_backlog> get_view_update_backlog_for_gossip(std::optional<db::view::update_backlog> last_backlog) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like these names because they don't really tell much about what those functions do. It's not obvious what are the requirements of "response" or "gossip".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the names a bit

db/view/view.cc Outdated
Comment on lines 2643 to 2671
if (!need_publishing && last_backlog && *last_backlog == max) {
return std::nullopt;
} else {
return max;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to check agains the last backlog at all? If the value has changed then one of the need_publishing flags must have been updated, no?

@piodul
Copy link
Contributor

piodul commented May 21, 2024

However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.

Your description is very information-dense and thus difficult to parse. In such cases, an example usually helps in understanding the bug. The example that you already provided in the description of the issue was, in my opinion, quite clear, so you could write an example here based on that.

To get the view update backlog in a response, we need to update
it on the shard which contained the altered data. The write
is redirected to the correct shard in storage_proxy::mutate_locally,
where database::apply is invoked on the correct shard. However,
we don't want to also invoke the storage_proxy::get_view_update_backlog
on the correct shard there, because that would cost us an additional
allocation (a .then() following the db.apply()) on the common write
path. More precisely, as discovered using perf-simple-query, for
a single operation, we get from 59 allocs, 15 tasks and ~52450
instructions to 60 allocs, 16 tasks and ~52950 instructions.

Instead, we can add the backlog update inside the database::apply call.
However, the call shouldn't be directly a part of apply() method, as
we don't want to refer to the storage_proxy there. It also shouldn't
be a part of the table::push_view_replica_updates() method for the
same reason. As a result, it needs to be returned from
view_update_generator::generate_and_propagate_view_updates(), and
passed as a return value through the call stack.

This patch prepares all methods in the stack by changing their
return value types. For now, the returned values are empty - the
actual value to return will be added in a following patch.
When performing a write, we should update the view update backlog
on the node and shard where the mutation is actually applied, and
potentially also on the coordinator of this write, with the updated
value from the replica.
Currently, instead of updating the backlog on the shard that applies
the mutation, we do it on the coordinator shard of this write, and
as a result, the backlog on the correct shard is not updated.
This patch enables updating the backlog on the correct shard. To
achieve that, the update is performed just after the views are pushed,
in view_update_generator::generate_and_propagate_view_updates on the
correct shard, and later propagated as a return value to be finally
used for updating the max view backlog of the node.
This patch adds a test for reproducing issue scylladb#18542
The test performs writes on a table with a materialized view and
checks that the view backlog increases. To get the current view
update backlog, a new metric "view_update_backlog" is added to
the `storage_proxy` metrics. The metric differs from the metric
from `database` metric with the same name by taking the backlog
from the max_view_update_backlog which keeps view update backlogs
from all shards which may be a bit outdated, instead of taking
the backlog by checking the view_update_semaphore which the backlog
is based on directly.
@wmitros
Copy link
Contributor Author

wmitros commented May 21, 2024

Your description is very information-dense and thus difficult to parse. In such cases, an example usually helps in understanding the bug. The example that you already provided in the description of the issue was, in my opinion, quite clear, so you could write an example here based on that.

I've mostly copied the example, hopefully that's good enough.

Also I rebased to fix the merge conflicts. At the same time, I added a commit for better readability to this PR so the patch is now the last 2 commits, not only the last one

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Details:

  • Duration: 4 hr 36 min
  • Builder: i-0656e49f3aa4a8738 (m5d.8xlarge)

Currently, we only update the backlogs in node_update_backlog at the
same time when we're fetching them. This is done using storage_proxy's
method get_view_update_backlog, which is confusing because it's a getter
with side-effects. Additionally, we don't always want to update the
backlog when we're reading it (as in gossip which is only on shard 0)
and we don't always want to read it when we're updating it (when we're
not handling any writes but the backlog drops due to background work
finish).

This patch divides the node_view_backlog::add_fetch as well the
storage_proxy::get_view_update_backlog both into two methods; one
for updating and one for reading the backlog. This patch only replaces
the places where we're currently using the view backlog getter, more
situations where we should get/update the backlog should be considered
in a following patch.
…onse

Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.

Fixes: scylladb#18461
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2024-05-22T08:02:37.215Z] FAILED: build/dev/db/view/node_view_update_backlog.hh.o 
[2024-05-22T08:02:37.215Z] clang++ -MD -MT build/dev/db/view/node_view_update_backlog.hh.o -MF build/dev/db/view/node_view_update_backlog.hh.o.d -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/dev/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -O2 -DSCYLLA_BUILD_MODE=dev -iquote. -iquote build/dev/gen  --include db/view/node_view_update_backlog.hh -c -o build/dev/db/view/node_view_update_backlog.hh.o build/dev/gen/empty.cc
[2024-05-22T08:02:37.216Z] In file included from <built-in>:1:
[2024-05-22T08:02:37.216Z] ./db/view/node_view_update_backlog.hh:55:5: error: no template named future; did you mean seastar::future?
[2024-05-22T08:02:37.217Z]     future<std::optional<update_backlog>> fetch_if_changed(std::optional<db::view::update_backlog> last_backlog);
[2024-05-22T08:02:37.217Z]     ^~~~~~
[2024-05-22T08:02:37.217Z]     seastar::future
[2024-05-22T08:02:37.217Z] /jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include/seastar/core/future.hh:1242:21: note: seastar::future declared here
[2024-05-22T08:02:37.217Z] class [[nodiscard]] future : private internal::future_base {
[2024-05-22T08:02:37.217Z]                     ^
[2024-05-22T08:02:37.217Z] 1 error generated.

Build Details:

  • Duration: 9 hr 46 min
  • Builder: i-011fe52b636329a1d (m5d.12xlarge)

Copy link
Contributor

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description looks better, thanks.

I reviewed the last commit. I left some comments. Please fix the nits, the compilation error and, if possible, rebase on master so that this commit does not require #18646.

Otherwise, looks good.

});

auto max = fetch();
if (!np && last_backlog && *last_backlog == max) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked about this before, but the comment was resolved.

Isn't it sufficient to just check:

if (!np) {

?

Previously, we used to compare the last backlog vs the current one in order to decide whether it needs publishing. Now, we have the need_publishing flag. Is the last_backlog && *last_backlog == max still necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole if..else could be written as a ternary operator:

return np ? max : std::nullopt;

I didn't try to compile it, there is a chance that you'll get a type error because both branches don't have the same type, however it would be nice to try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked about this before, but the comment was resolved.

Isn't it sufficient to just check:

if (!np) {

?

Previously, we used to compare the last backlog vs the current one in order to decide whether it needs publishing. Now, we have the need_publishing flag. Is the last_backlog && *last_backlog == max still necessary?

After #18804 is merged and the need_publishing is also updated in that context, the last_backlog && *last_backlog == max check won't be needed anymore. Without #18804 the need_publishing flag is not set when a backlog drops, so we have no other way of detecting it other than comparing with the last sent backlog

Comment on lines +2653 to +2657
if (_backlogs[shard].need_publishing) {
_backlogs[shard].need_publishing = need_publishing::no;
return need_publishing::yes;
}
return need_publishing::no;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can use std::exchange to make the body of the lambda nice and short:

return std::exchange(_backlogs[shard].need_publishing, need_publishing::no);

@@ -235,6 +235,8 @@ public:
// Get information about this node's view update backlog. It combines information from all local shards.
db::view::update_backlog get_view_update_backlog();

future<std::optional<db::view::update_backlog>> get_view_update_backlog_if_changed(std::optional<db::view::update_backlog> last_backlog);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you require that this function must be called on shard 0 only. Please put this information in a comment near the signature so that the future users are not surprised.

return a || b;
});

auto max = fetch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be prone to races. Consider the following scenario:

  • _backlog is updated on shard 1, the need_publishing flag is set,
  • The _last_update is recalculated, next one will happen 10ms from now or later,
  • 5ms pass,
  • fetch_if_changed and clears the need_publishing flag on shard 1, but uses fetch to read the old value,
  • There are no MV writes anymore and we didn't publish the updated backlog on shard 1.

You should collect the backlogs in the submit_to lambda and calculate the maximum in the reducer lambda (you should still collect the np flag in addition to this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

View update backlog may not be gossiped if it's already been updated with a response
3 participants