Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvlog: add gaps to the segment reader #18570

Merged
merged 2 commits into from
May 31, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented May 17, 2024

Adds a gap set to the readable segment so that readers can iterate as
of a specific version.

The full set of gaps in the segment is long-lived, so this is owned by
the readable segment. Segment readers make of copy of this set (expected
to be small) when constructed, before iterating.

Longer term, multiple segments may have gaps added to them at a time,
but this commit focuses only on introducing the per-segment gaps.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

src/v/storage/mvlog/version_id.h Outdated Show resolved Hide resolved
src/v/storage/mvlog/versioned_gaps.h Outdated Show resolved Hide resolved
size_t num_readers() const { return num_readers_; }
const versioned_gap_list& gaps() const { return gaps_; }
versioned_gap_list* mutable_gaps() { return &gaps_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to have the mutable prefix. not sure yet how this is used but keeping in the spirit of mvcc/copy-on-write i wonder if we can arrange for gap modification to merely create a new object that has the modification in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with Noah and summarizing here:

  • The current implementation is merge-on-read, which allows readers to asynchronously determine what gaps exist at construction time, since the version id uniquely identifies the point in time within a versioned gaps list.
  • This necessarily exposes version ids as public interface within the mvlog, which can be nice because it’s straightforward to reason about gaps as they are added in time, but ultimately doesn't seem that necessary. I originally thought version ids would need to be more prominent, but I'm realizing that this probably isn't the case.
  • If instead, we maintained an interval_set<size_t> in the readable_segment of the current set of gaps (common case is that this is empty), and made segment_readers make a copy of this set and the current file size, we could still get the mvcc properties: segment readers would each still have a point-in-time view of what is readable in the given segment.

I'm leaning towards doing this now because it removes the need to track version ids at all, which seems like a win. Longer term if the copies end up being too expensive, such an implementation could always be amended with a persistent data structure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked version ids because they allowed to get a "consistent" view of the log.

Here is a scenario where it would be useful:

  1. Fetch request arrives to redpanda
  2. We validate min/max offset, leader epoch, all good, we forward the request to the storage layer to read offset [5, 10]
  3. Concurrently: Log is being truncated, new entries are being appended
  4. What data will the fetch request get back? Can you reason about the correctness?

If at step 2 we would have captured a log versioned id (true mvcc), then you wouldn't have to think about potential problems introduced by step 3. It would also make it much easier to reason about correctness.

I believe this problem exists today in Redpanda. We don't lock the log before the step 2, do we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely this wasn't part of original requirements but it might be good time to think about it. PS. build failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted briefly in standup, but putting the context back here.

In this scenario, at step 2, we would synchronously create all relevant segment readers, which would make copies of the relevant gaps (expected to be small). This will eventually also entail looking up via index where in the first segment we should start reading the first segment from.
This set of readers would be a consistent snapshot of the log, and in that sense, the current log has the relevant consistency properties of MVCC, but doesn't come with a logical timestamp.

You brought up some other good points:

  • The previous impl was nice because a version id was like an explicit key to getting the point-in-time metadata. For instance, even if there were asynchronous work being done on the metadata after acquiring the id, the id could be used to query any sources of metadata to get back the point-in-time set of segments, point in time position in a file, point in time timestamps, etc. In the previous impl, this was true for file gaps, but not for other kinds of metadata, with the expectation that metadata is queried synchronously and snapshots (log readers) are created synchronously. Without making version ids a core piece of all metadata (at least segment_set and segment_index) I think we do need to rely on some other kind of synchronization (in this case, synchronously snapshotting metadata).
  • It would be nice to expose some kind of logical timestamp even above the storage layer, and generally that it would be great if both local storage and cloud storage had this concept of a logical timestamp to see consistent metadata. I agree this would be a great goal to drive towards, though I haven't thought explicitly about expanding MVCC across cloud storage. That said, one property that I'm trying to create with this impl is for log readers to be created synchronously, to at least provide a consistent snapshot that can be reasoned about after checking other metadata in the same scheduling point.

Our discussion was cut short, but I'm happy to discuss further.

}

// Test applying random gaps and reading from different positions in the file.
TEST_F(SegmentReaderTest, TestRandomReadsWithGaps) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to track the versions make_reader was called with and vassert if at any point anything tries to add a gap below that version? Could help avoid bugs.

src/v/storage/mvlog/versioned_gaps.cc Outdated Show resolved Hide resolved
src/v/storage/mvlog/readable_segment.h Outdated Show resolved Hide resolved
src/v/storage/mvlog/segment_reader.cc Outdated Show resolved Hide resolved
file_gaps.erase(gap_it);
continue;
}
read_intervals.emplace_back(cur_iter_pos, length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, shouldn't this be

auto remaining_length = max_pos - cur_iter_pos;
if (remaining_length > 0) {
  read_intervals.emplace_back(cur_iter_pos, remaining_length);
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this was buggy. Fixed and added a test!

@andrwng andrwng force-pushed the mvlog-segment-reader-gaps branch 2 times, most recently from f535e2b to cce8c5c Compare May 24, 2024 18:14
@andrwng andrwng force-pushed the mvlog-segment-reader-gaps branch from cce8c5c to a856319 Compare May 24, 2024 18:15
@andrwng andrwng force-pushed the mvlog-segment-reader-gaps branch 2 times, most recently from c666e36 to f407f18 Compare May 24, 2024 18:39
Adds a gap set to the readable segment so that readers can iterate as
of a specific version.

The full set of gaps in the segment is long-lived, so this is owned by
the readable segment. Segment readers make of copy of this set (expected
to be small) when constructed, before iterating.

Longer term, multiple segments may have gaps added to them at a time,
but this commit focuses only on introducing the per-segment gaps.
@andrwng andrwng merged commit 658cd7b into redpanda-data:dev May 31, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants