New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-latest WAL log corruption after EBS snapshot restore in AWS #12467
Comments
Could you please provide the setting you have for |
TiKV uses kPointInTimeRecovery and kAbsoluteConsistency, so it has WAL recycle disabled. ie recycle_log_file_num=0. As for WriteOptions, I only see 3 options ever being set.
|
Thanks for the info. Does TiKV also use an external syncer thread, that is, setting Did your LOG file contain any occurrences of "WAL Sync error"? Do you still have 15557239.log? I can't tell from the error message, but it would be interesting to know if this error is around the end of the log file. |
Great question. I think @tonyxuqqi mentioned tikv does trigger flushWAL manually, except I can't find a code reference. So @tonyxuqqi , if you can confirm.
Yeah. We had checked earlier. Unfortunately no "sync error". THat would have a perfect smoking gun.
We don't have the log file. Right now. But we 10s of repro per week. How can you tell if corruption is in middle or end. FWIW, what I saw was there were
|
This is good info. I think 359 missing records should mean the corruption was near the tail. It might be the number of records between the last There is one bug fix I see that could be relevant: #10560 I think you said you hit this problem even before turning on WAL tracking? WAL tracking is a good feature, but important bug fixes are needed that probably are not present in version 6.29. #10185 sounds like a possible fix for the symptom you described. |
Yes. We enabled to debug the corruption. So corruption was there before. Both fixes seem very relevant. #10560 particularly seems to match @tonyxuqqi hypothesis we were discussing earlier this week. I can try patching this one, to fix the corruption. And #10185 to fix the manifest tracking. We indeed see 100s/hour of size mismatches, which don't point to a corrupt WALs. |
@ajkr can rocksdb start writing to a new WAL before previous WALs are fsynced. |
Hi, yes, it can. When next WAL sync is requested, though, we fsync in order from oldest unsynced WAL up to the active WAL |
I think It doesn't mean that latest WAL file is written only after old WAL files. Filesystem can flush the blocks independently. Since the file system's flushing behavior may not guarantee the sequential writing of WAL files, is it safe to assume that the tail of the non latest WAL file is always completely written in case of unclean shutdown ? |
Yes you are right. As you described it, there is a valid scenario where point-in-time recovery encounters "Corruption: checksum mismatch" and does not proceed to recover the most recent log. However, this scenario should not be enough to trigger "Corruption: SST file is ahead of WALs in CF lock". There is another fsync procedure to prevent that, which is we fsync the WALs whose data was flushed before recording the SST file to the MANIFEST. |
Thanks, It makes sense. |
Hi Andrew, I have put some comments on the PR #10560. |
Replied there - feel free to follow up there if I missed something |
@ajkr We already have this #10185 in our TiKV branch. It appears that PingCAP cherry-picked a few changes after forking from the 6.29 release. I have attached a screenshot of the cherry-picked changes for your reference. |
+1 to @mittalrishabh said. We are cherrypicking the fsync fix #10560. We will update here, if it fixes the WAL corruption. For WAL size mismatch with manifest, as @mittalrishabh said, its already included, and we still see LOTs (every rocksdb instance shows 1 mismatch per hour) of mismatches. So wondering if its possible there is another bug in WAL size tracking. @ajkr : Is there anything we can provide to help you debug that. |
Do you have an example error message? I am not sure if I'm reading something backwards or looking in the wrong place. The code I am looking at is this - Lines 198 to 205 in 1856734
For that check at least, actual WAL size bigger than the manifest size should be tolerated. |
@ajkr : Our size mismatch detection is not based on that check, you referenced above. But rather, we have a script, which runs every 2 mins, an on every storage node (running rocksdb), it checks the current WALs (except latest one), and compares the actual file size, with the file size in manifest. So a question:
In general, we just want to know if size mismatch points to something nefarious (like another bug) at all. |
The short answer is, no, what you're checking is not guaranteed. Longer answer: the synced size recorded in the MANIFEST is just a lower bound on the latest synced offset. The actual latest synced offset (and thus file size) may be larger. It was originally built like this because we also recorded synced size for the active WAL, which is continually growing. Currently |
Just confirming, even for inactive (what we called non-latest in OP) WALs, it (size matching) is not guaranteed. Would it 'eventually' match after your fix in https://github.com/facebook/rocksdb/pull/10560/files#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R1541. It seems so, since we don't stop tracking WALs, until presync size and flushed size are equal. I understand they might NOT match for a short duration, bet finishsync (line, 1547) and addWAL (line 1539) in next call to markWALSynced, which might be a few seconds.
What do you mean by special casing for inactive WALs. Writing both preSyncSize and postSyncSize seems reasonable for all WALs. |
Yes I agree the MANIFEST synced size should eventually match the actual synced size (with that fix).
Currently this feature records pre-sync sizes upon sync completing for inactive WALs only. Say a WAL is synced N times in its lifetime. It is active the first (N-M) times a sync completes, and inactive for the final M sync completions (I think M must be either 1 or 2...). This feature records the final M syncs to the MANIFEST, which feels odd. My intuition is that the feature should either (a) record all N syncs to the MANIFEST, or (b) record only the final 1 sync to the MANIFEST. (b) is what I meant would special case the feature for inactive WALs, but I hope we do (a) instead. |
@ajkr We are trying to reproduce this issue in our setup. For faster reproduction, instead of doing the full recovery, we are only doing WAL dump and depending on " "Corruption: checksum mismatch" message to see if it is reproducible. Based on the missing fsync bug #10560 , it seems that it could lead to missing records too. What would be the error message if rocksdb detect missing records. |
There isn't a specific error message for missing records because we don't have a way to detect that in general. (Extra detail about how we're thinking about doing it) We could start requiring consecutive WALs to have consecutive sequence numbers. To prevent Lacking that, do you have a way to check for holes in the recovered data? Verifying that in our test tooling is actually how I found that bug in the first place - https://rocksdb.org/blog/2022/10/05/lost-buffered-write-recovery.html. I think applications storing consensus logs in RocksDB may be able to detect a hole using their log sequence numbers. |
Got it. In TiDB, writes from multiple raft logs are asynchronously written to RocksDb. I don't think we can use it to detect holes. |
There are some ways that can catch recovery not recovering far enough. By "far enough", I mean recovery should at least reach the last write that completed before the last successful WAL sync started. At the RocksDB level, the main detection would be the "SST file is ahead of WALs" error message. At the application level, this scenario could probably be detected more easily than a hole. I recall in Cockroach it caused some sort of Raft log panic. |
Going back to the original issue. I'm still wondering, if the corruption we see, is indeed fixed by #10560. Its a bit hard for us to confirm, since we can only repro in production, which happens rarely (2% of disk snapshots, across all storage nodes had a single corrupt WAL). I have a few questions:
|
Couple of other questions:
|
OK, I revisited #10560. I think it cannot be triggered by a workload that simply writes with varying values for #10560's race condition required writes to happen between (2) and (3) causing buggy logic in (3) to incorrectly treat an inactive WAL as fully synced when it actually wasn't. That doesn't mean the workload needs an explicit Based on that criteria, other vulnerable APIs I found were By the way, are you able to share more about what options you use? From the OPTIONS file I'd be interested in |
Version: RocksDB 6.29
Expected behavior
Using kPointInTimeRecovery, after EBS restore, the non-latest WAL log file should not be corrupted.
Actual behavior
The non-latest WAL file is corrupted after EBS restore.
And we also noticed that after enable track_and_verify_wals_in_manifest, some WAL files' size in manifest does not match the actual one---the actual size is bigger than the one in manifest.
"[2024/03/08 14:14:16.534 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:818] /var/lib/wal/rocksdb/15557239.log: dropping 32768 bytes; Corruption: checksum mismatch
[2024/03/08 14:14:16.534 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:818] /var/lib/wal/rocksdb/15557239.log: dropping 64430 bytes; Corruption: error in middle of record
[2024/03/08 14:14:16.534 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:1140] Point in time recovered to log #15557239 seq #296119395244
[2024/03/08 14:14:16.534 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:890] Recovering log #15557242 mode 2
[2024/03/08 14:14:16.535 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:896] /var/lib/wal/rocksdb/15557242.log: dropping 36031876 bytes
[2024/03/08 14:14:16.535 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:890] Recovering log #15557245 mode 2
[2024/03/08 14:14:16.536 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:896] /var/lib/wal/rocksdb/15557245.log: dropping 35957673 bytes
"
Steps to reproduce the behavior
Running some heavy write workload, take the EBS snapshot and then restore it in another machine. Open the rocksdb.
The text was updated successfully, but these errors were encountered: