Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/tablets: Check that after RF change data is replicated properly #18644

Conversation

xemul
Copy link
Contributor

@xemul xemul commented May 13, 2024

There's a test that checks system.tablets contents to see that after changing ks replication factor via ALTER KEYSPACE the tablet map is updated properly. This patch extends this test that also validates that mutations themselves are replicated according to the desired replication factor.

refs: #16723

There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@xemul xemul added the backport/none Backport is not required label May 13, 2024
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_tablets

Failed Tests (2/9300):

Build Details:

  • Duration: 16 hr
  • Builder: i-039f7597e9d4e9df3 (m5ad.12xlarge)

@xemul
Copy link
Contributor Author

xemul commented May 14, 2024

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 topology_custom/test_tablets

Failed Tests (2/9300):

* [test_node_failure_during_tablet_migration[write_both_read_new-destination]](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/8753/testReport/junit/%28root%29/test_tablets_migration/test_node_failure_during_tablet_migration_write_both_read_new_destination_) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_node_failure_during_tablet_migration%5Bwrite_both_read_new-destination%5D)

* [topology_custom.test_tablets_migration.debug.84](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/8753/testReport/junit/%28root%29/non-boost%20tests/topology_custom_test_tablets_migration_debug_84) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+topology_custom.test_tablets_migration.debug.84)

Build Details:

* Duration: 16 hr

* Builder: i-039f7597e9d4e9df3 (m5ad.12xlarge)

The PR touches test case test_tablets::test_tablet_rf_change, while the failing test is test_tablets::test_node_failure_during_tablet_migration, so it's not this PR being buggy

Next, the failure is

[2024-05-14T02:41:55.099Z] E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/addserver, params: None, json: {'start': True, 'config': {'enable_user_defined_functions': False, 'experimental_features': ['tablets']}}, body:
[2024-05-14T02:41:55.100Z] E               failed to start the node, server_id 5089, IP 127.234.197.10, workdir scylla-5089, host_id <missing>, cql [not connected]
[2024-05-14T02:41:55.100Z] E               Check the log files:
[2024-05-14T02:41:55.100Z] E               /scylladir/testlog/x86_64/test.py.debug-release-dev.log
[2024-05-14T02:41:55.100Z] E               /scylladir/testlog/x86_64/debug/scylla-5089.log

It couldn't add new node. Here's why (scylla-5089.log)

INFO  2024-05-13 22:09:48,863 [shard 0:main] init - starting API server
INFO  2024-05-13 22:09:48,869 [shard 0:main] init - starting prometheus API server
INFO  2024-05-13 22:09:48,874 [shard 0:main] init - creating snitch
INFO  2024-05-13 22:09:48,875 [shard 0:main] init - starting tokens manager
INFO  2024-05-13 22:09:48,877 [shard 0:main] init - starting effective_replication_map factory
INFO  2024-05-13 22:09:48,877 [shard 0:main] init - starting migration manager notifier
INFO  2024-05-13 22:09:48,878 [shard 0:main] init - starting per-shard database core
INFO  2024-05-13 22:09:48,879 [shard 0:main] init - creating and verifying directories
INFO  2024-05-13 22:09:48,974 [shard 0:main] init - starting compaction_manager
INFO  2024-05-13 22:09:48,974 [shard 0:main] task_manager - Registered module compaction
INFO  2024-05-13 22:09:48,981 [shard 1:main] task_manager - Registered module compaction
INFO  2024-05-13 22:09:48,986 [shard 0:main] compaction_manager - Set unlimited compaction bandwidth
INFO  2024-05-13 22:09:48,988 [shard 0:main] init - starting database
INFO  2024-05-13 22:09:49,052 [shard 0:main] seastar - updated: blocked-reactor-notify-ms=25
INFO  2024-05-13 22:09:49,052 [shard 1:main] seastar - updated: blocked-reactor-notify-ms=25
INFO  2024-05-13 22:09:49,053 [shard 0:main] init - starting storage proxy
AddressSanitizer:DEADLYSIGNAL
=================================================================
==441091==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7fb0c72189d6 bp 0x7ffd126ae340 sp 0x7ffd126ae2f0 T0)
Reactor stalled for 33 ms on shard 1. Backtrace: 0xd6ad90a 0x48b2a14 0x48b1f1c 0x46ecb92 0x46e770f 0x46e71b7 0x46e7a78 0x46edfaa 0x3dbaf 0xd676896 0x43eaef0 0x442c58b 0x442c076 0x442bbca 0x443d1d3 0x443cbbf 0x443c6ea 0x443c462 0x43d8f92 0x43bff14 0x43bf062 0x43c2057 0x43b2df9 0x110de194 0xd7e4a1e 0x485be5b 0x474b960 0x481df3b 0x481d95f 0x481d8af 0x481d3e3 0x481cfc7 0x481fe0f 0x481d10b 0x474d97d 0x474d705 0x482d815 0x482c3c3 0x4830463 0x470f65e 0x4717b80 0x471bba2 0x481b015 0x48191b0 0x48190a0 0x481880c 0x44be108 0x8c946 0x11296f
==441091==The signal is caused by a READ memory access.
==441091==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

@xemul
Copy link
Contributor Author

xemul commented May 17, 2024

CI takes 40 hours, spot instances don't survive for that long. Converting it to draft until #18704 , it's pointless to wait for it

@yaronkaikov
Copy link
Contributor

yaronkaikov commented May 21, 2024

CI takes 40 hours, spot instances don't survive for that long. Converting it to draft until #18704 , it's pointless to wait for it

@xemul 66ce5f9708e6ab494ccfa57e9abe06e9e991a464 promoted to master, you can run the CI now

@xemul xemul marked this pull request as ready for review May 21, 2024 18:03
@xemul
Copy link
Contributor Author

xemul commented May 21, 2024

CI takes 40 hours, spot instances don't survive for that long. Converting it to draft until #18704 , it's pointless to wait for it

@xemul 66ce5f9708e6ab494ccfa57e9abe06e9e991a464 promoted to master, you can run the CI now

Re-kicked the CI job. Let's see how it goes

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_tablets::*
✅ - Container Test

Build Details:

  • Duration: 1 hr 52 min
  • Builder: spider1.cloudius-systems.com

denesb pushed a commit that referenced this pull request May 22, 2024
There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #18644
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_custom/test_tablets
✅ - Container Test

Build Details:

  • Duration: 1 hr 47 min
  • Builder: spider2.cloudius-systems.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants