test/tablets: Check that after RF change data is replicated properly #18644

xemul · 2024-05-13T10:01:13Z

There's a test that checks system.tablets contents to see that after changing ks replication factor via ALTER KEYSPACE the tablet map is updated properly. This patch extends this test that also validates that mutations themselves are replicated according to the desired replication factor.

refs: #16723

There's a test that checks system.tablets contents to see that after changing ks replication factor via ALTER KEYSPACE the tablet map is updated properly. This patch extends this test that also validates that mutations themselves are replicated according to the desired replication factor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

scylladb-promoter · 2024-05-14T02:55:31Z

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_tablets

Failed Tests (2/9300):

Build Details:

Duration: 16 hr
Builder: i-039f7597e9d4e9df3 (m5ad.12xlarge)

xemul · 2024-05-14T06:16:35Z

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 topology_custom/test_tablets

Failed Tests (2/9300):

* [test_node_failure_during_tablet_migration[write_both_read_new-destination]](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/8753/testReport/junit/%28root%29/test_tablets_migration/test_node_failure_during_tablet_migration_write_both_read_new_destination_) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_node_failure_during_tablet_migration%5Bwrite_both_read_new-destination%5D)

* [topology_custom.test_tablets_migration.debug.84](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/8753/testReport/junit/%28root%29/non-boost%20tests/topology_custom_test_tablets_migration_debug_84) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+topology_custom.test_tablets_migration.debug.84)

Build Details:

* Duration: 16 hr

* Builder: i-039f7597e9d4e9df3 (m5ad.12xlarge)

The PR touches test case test_tablets::test_tablet_rf_change, while the failing test is test_tablets::test_node_failure_during_tablet_migration, so it's not this PR being buggy

Next, the failure is

[2024-05-14T02:41:55.099Z] E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/addserver, params: None, json: {'start': True, 'config': {'enable_user_defined_functions': False, 'experimental_features': ['tablets']}}, body:
[2024-05-14T02:41:55.100Z] E               failed to start the node, server_id 5089, IP 127.234.197.10, workdir scylla-5089, host_id <missing>, cql [not connected]
[2024-05-14T02:41:55.100Z] E               Check the log files:
[2024-05-14T02:41:55.100Z] E               /scylladir/testlog/x86_64/test.py.debug-release-dev.log
[2024-05-14T02:41:55.100Z] E               /scylladir/testlog/x86_64/debug/scylla-5089.log

It couldn't add new node. Here's why (scylla-5089.log)

INFO  2024-05-13 22:09:48,863 [shard 0:main] init - starting API server
INFO  2024-05-13 22:09:48,869 [shard 0:main] init - starting prometheus API server
INFO  2024-05-13 22:09:48,874 [shard 0:main] init - creating snitch
INFO  2024-05-13 22:09:48,875 [shard 0:main] init - starting tokens manager
INFO  2024-05-13 22:09:48,877 [shard 0:main] init - starting effective_replication_map factory
INFO  2024-05-13 22:09:48,877 [shard 0:main] init - starting migration manager notifier
INFO  2024-05-13 22:09:48,878 [shard 0:main] init - starting per-shard database core
INFO  2024-05-13 22:09:48,879 [shard 0:main] init - creating and verifying directories
INFO  2024-05-13 22:09:48,974 [shard 0:main] init - starting compaction_manager
INFO  2024-05-13 22:09:48,974 [shard 0:main] task_manager - Registered module compaction
INFO  2024-05-13 22:09:48,981 [shard 1:main] task_manager - Registered module compaction
INFO  2024-05-13 22:09:48,986 [shard 0:main] compaction_manager - Set unlimited compaction bandwidth
INFO  2024-05-13 22:09:48,988 [shard 0:main] init - starting database
INFO  2024-05-13 22:09:49,052 [shard 0:main] seastar - updated: blocked-reactor-notify-ms=25
INFO  2024-05-13 22:09:49,052 [shard 1:main] seastar - updated: blocked-reactor-notify-ms=25
INFO  2024-05-13 22:09:49,053 [shard 0:main] init - starting storage proxy
AddressSanitizer:DEADLYSIGNAL
=================================================================
==441091==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7fb0c72189d6 bp 0x7ffd126ae340 sp 0x7ffd126ae2f0 T0)
Reactor stalled for 33 ms on shard 1. Backtrace: 0xd6ad90a 0x48b2a14 0x48b1f1c 0x46ecb92 0x46e770f 0x46e71b7 0x46e7a78 0x46edfaa 0x3dbaf 0xd676896 0x43eaef0 0x442c58b 0x442c076 0x442bbca 0x443d1d3 0x443cbbf 0x443c6ea 0x443c462 0x43d8f92 0x43bff14 0x43bf062 0x43c2057 0x43b2df9 0x110de194 0xd7e4a1e 0x485be5b 0x474b960 0x481df3b 0x481d95f 0x481d8af 0x481d3e3 0x481cfc7 0x481fe0f 0x481d10b 0x474d97d 0x474d705 0x482d815 0x482c3c3 0x4830463 0x470f65e 0x4717b80 0x471bba2 0x481b015 0x48191b0 0x48190a0 0x481880c 0x44be108 0x8c946 0x11296f
==441091==The signal is caused by a READ memory access.
==441091==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

xemul · 2024-05-17T09:54:33Z

CI takes 40 hours, spot instances don't survive for that long. Converting it to draft until #18704 , it's pointless to wait for it

yaronkaikov · 2024-05-21T16:06:32Z

CI takes 40 hours, spot instances don't survive for that long. Converting it to draft until #18704 , it's pointless to wait for it

@xemul 66ce5f9708e6ab494ccfa57e9abe06e9e991a464 promoted to master, you can run the CI now

xemul · 2024-05-21T18:03:50Z

CI takes 40 hours, spot instances don't survive for that long. Converting it to draft until #18704 , it's pointless to wait for it

@xemul 66ce5f9708e6ab494ccfa57e9abe06e9e991a464 promoted to master, you can run the CI now

Re-kicked the CI job. Let's see how it goes

scylladb-promoter · 2024-05-21T21:30:14Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_tablets::*
✅ - Container Test

Build Details:

Duration: 1 hr 52 min
Builder: spider1.cloudius-systems.com

There's a test that checks system.tablets contents to see that after changing ks replication factor via ALTER KEYSPACE the tablet map is updated properly. This patch extends this test that also validates that mutations themselves are replicated according to the desired replication factor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #18644

scylladb-promoter · 2024-05-22T14:49:30Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_tablets
✅ - Container Test

Build Details:

Duration: 1 hr 47 min
Builder: spider2.cloudius-systems.com

xemul added the backport/none Backport is not required label May 13, 2024

denesb approved these changes May 13, 2024

View reviewed changes

xemul mentioned this pull request May 14, 2024

AddressSanitizer:DEADLYSIGNAL while starting proxy #18661

Open

xemul marked this pull request as draft May 17, 2024 09:53

xemul mentioned this pull request May 20, 2024

tablets: alter keyspace #16723

Open

xemul marked this pull request as ready for review May 21, 2024 18:03

denesb approved these changes May 22, 2024

View reviewed changes

scylladb-promoter closed this in 26eda88 May 23, 2024

github-actions bot added the promoted-to-master label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/tablets: Check that after RF change data is replicated properly #18644

test/tablets: Check that after RF change data is replicated properly #18644

xemul commented May 13, 2024

scylladb-promoter commented May 14, 2024

xemul commented May 14, 2024

🔴 CI State: FAILURE

Failed Tests (2/9300):

Build Details:

xemul commented May 17, 2024

yaronkaikov commented May 21, 2024 •

edited

xemul commented May 21, 2024

scylladb-promoter commented May 21, 2024

scylladb-promoter commented May 22, 2024

test/tablets: Check that after RF change data is replicated properly #18644

test/tablets: Check that after RF change data is replicated properly #18644

Conversation

xemul commented May 13, 2024

scylladb-promoter commented May 14, 2024

🔴 CI State: FAILURE

Failed Tests (2/9300):

Build Details:

xemul commented May 14, 2024

🔴 CI State: FAILURE

Failed Tests (2/9300):

Build Details:

xemul commented May 17, 2024

yaronkaikov commented May 21, 2024 • edited

xemul commented May 21, 2024

scylladb-promoter commented May 21, 2024

🟢 CI State: SUCCESS

Build Details:

scylladb-promoter commented May 22, 2024

🟢 CI State: SUCCESS

Build Details:

yaronkaikov commented May 21, 2024 •

edited