Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Raft can persist a corrupted message on a follower, if the packet gets corrupted on the network, leading to crashes on the tserver. #22344

Open
1 task done
shamanthchandra-yb opened this issue May 10, 2024 · 3 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented May 10, 2024

Jira Link: DB-11251

Description

One of the CDC run, while we are testing PG Parity, observed corruption fatal on 2024.1.0.0-b123

F20240509 20:44:08 ../../src/yb/tablet/tablet.cc:1517] T 6b08eefae2b84fa189350a8c956a904f P d132e7cd9cf549f3b6f9df427e383c04: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/dockv/doc_kv_util.cc:135): Error when decoding hashed components of a document key: while consuming primitive values from 5366326436323032612D363962342D343362372D386532632D6462643530633632663538353A31353838393A346532696563655A3F696C30673F6766306A3F6B3365313F323032686331696534686B683566336A66646630616A6D3168616867316B616C346632613331336964326A6635696C693667346B67656731626B6E3269626968326C626D356733623432346A65336B67366A6D6A3621324F4D513632322380013C328BE6C5BE804A: Encoded string is not terminated with \0x00\0x00
    @     0x56030e766427  google::LogMessage::SendToLog()
    @     0x56030e76735d  google::LogMessage::Flush()
    @     0x56030e7679a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x56030fbbb933  yb::tablet::Tablet::WriteToRocksDB()
    @     0x56030fbb7905  yb::tablet::Tablet::ApplyIntents()
    @     0x56030fbb84d2  yb::tablet::Tablet::ApplyIntents()
    @     0x56030fc70b71  yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
    @     0x56030fb9204c  yb::tablet::UpdateTxnOperation::DoReplicated()
    @     0x56030fb8579e  yb::tablet::Operation::Replicated()
    @     0x56030fb87b4f  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x56030ec3ca2b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0x56030ec8b38f  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0x56030ec8a6f9  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x56030ec74004  yb::consensus::RaftConsensus::UpdateReplica()
    @     0x56030ec53f83  yb::consensus::RaftConsensus::Update()
    @     0x56030fed9f9e  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0x56030ece1fee  std::__1::__function::__func<>::operator()()
    @     0x56030ece2c1f  yb::consensus::ConsensusServiceIf::Handle()
    @     0x56030fadf649  yb::rpc::ServicePoolImpl::Handle()
    @     0x56030f9fc05f  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x56030faeee43  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x560310306c13  yb::Thread::SuperviseThread()
    @     0x7f73305551ca  start_thread
    @     0x7f73307a6e73  __GI___clone

Here’s what testcase is about:

Create 10 databases
Create 1 table in each, for which we are interested in CDC streaming
In iteration manner, load data in each of the table. Validate they are streaming.
In parallel, there are nemesis happening. Server side nemesis in this run were: Stop/start nodes, Restart master process. 
In parallel I am also creating and dropping dummy tables randomly.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added area/docdb YugabyteDB core features priority/high High Priority status/awaiting-triage Issue awaiting triage labels May 10, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label May 10, 2024
@rthallamko3 rthallamko3 removed the status/awaiting-triage Issue awaiting triage label May 10, 2024
@rthallamko3
Copy link
Contributor

The very first error [1] indicates that the raft replication ran into issues and the next set of fatals in yb::tablet::Tablet::WriteToRocksDB() reported in #22344 are cascading failures. I checked the other nodes and it seems like this is being hit only on N2. So it does look like a packet corruption of some sort on the node N2.

[1]

Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0509 20:39:43.549770 33092 tablet_peer.cc:1378] Invalid value of enum consensus::OperationType (full enum type: yb::consensus::OperationType, expression: replicate_msg->op_type()): 0.

@rthallamko3 rthallamko3 changed the title [DocDB] Failed to write a batch with 0 operations into RocksDB: Corruption (yb/dockv/doc_kv_util.cc:135): Error when decoding hashed components of a document key [DocDB] Raft can persist a corrupted message on a follower, if the packet gets corrupted on the network, leading to crashes on the tserver. May 21, 2024
@rthallamko3
Copy link
Contributor

@shamanthchandra-yb , Can we check if this repros on clusters with TLS enabled?

@shamanthchandra-yb
Copy link
Author

@rthallamko3 Many versions of this test case are being run currently as part of CDC PG Parity testing. This was the only one-off run where this issue occurred, 2 weeks back. There seems to be a very minuscule chance of hitting it, even without TLS. I don't think that even if it passes, we will have sufficient data to confirm the theory. Please share if you think if it would be still helpful, if we run with TLS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants