Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstorage: speed up scan for range descriptors #123959

Merged
merged 1 commit into from May 17, 2024

Conversation

RaduBerinde
Copy link
Member

On startup we currently scan through all /Local/Range/ keys in order
to find the range descriptors. In some cases, there can be a lot of
/Local/Range/Transaction records so we end up reading through a lot
of data.

This change switches to using SeekGE:

  • when we encounter a key that has a suffix before rdsc like
    /Local/Range/foo/prbe, we seek directly to the corresponding
    descriptor key /Local/Range/foo/rdsc;
  • when we encounter a key that has a suffix after rdsc, we
    seek beyond all keys for this range, e.g. /Local/Range/fop/.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be very
detrimental even when there aren't a lot of keys to skip.

Fixes: #109740

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@RaduBerinde RaduBerinde force-pushed the iter-range-desc branch 4 times, most recently from 4828266 to 58777c3 Compare May 14, 2024 02:17
@RaduBerinde RaduBerinde marked this pull request as ready for review May 14, 2024 02:17
@RaduBerinde RaduBerinde requested a review from a team as a code owner May 14, 2024 02:17
Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/kv/kvserver/kvstorage/init.go line 329 at r1 (raw file):

				return errors.AssertionFailedf("range key has intent with no timestamp")
			}
			// Seek to the latest value below the intent timestamp.

This code is well exercised by loqrecovery.TestRetrieveApplyStatus, which I have run with --stress

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

A few nits, but otherwise changes look good.

I'm not familiar enough with the code and testing in this area, so having a storage review as well would be good.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @RaduBerinde)


-- commits line 17 at r1:
is -> are


pkg/kv/kvserver/kvstorage/init.go line 281 at r1 (raw file):

		}

		const reportPeriod = 10 * time.Second

nit: This would be more idiomatic if you used log.EveryN and then use call log.ShouldLog.


pkg/kv/kvserver/kvstorage/init.go line 316 at r1 (raw file):

		}

		if key.Timestamp.IsEmpty() {

I'm confused how we get here. Didn't we just check the suffixCmp above so we know this is a LocalRangeDescriptorSuffix? Or does this check mean its a "RangeDescriptor intent"? I'm a little unclear why we need to iterate in Inconsistent mode at all here, so maybe just a comment which justifies that would help.


pkg/kv/kvserver/kvstorage/init.go line 342 at r1 (raw file):

			// versions of this key.
			tombstoneCount++
			iter.NextKey()

Couldn't we call
iter.SeekGE(storage.MVCCKey{Key: nextRangeDescKey}) instead of NextKey.

This probably doesn't normally matter in the scheme of things but might be important if we had a large table delete and had not run an MVCCGC yet.


pkg/kv/kvserver/kvstorage/init.go line 357 at r1 (raw file):

				key.Key, desc.StartKey)
		}
		if !desc.IsInitialized() {

I see this check in the previous commit as well, but what is the rational for failing on an uninitialized replica. I don't fully understand the lifecycle here, but all replicas pass through this state, but maybe the reason is that we neve persiste them to disk? This is more for my understanding of persisted RangeDescriptors but consider adding a comment why this is an assertion error or make the message cleaner.


pkg/kv/kvserver/kvstorage/init.go line 377 at r1 (raw file):

}

func IterateRangeDescriptorsFromDiskOld(

It would be cleaner to delete the old code rather than leaving it here.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist, @bdarnell, and @jbowens)


-- commits line 17 at r1:

Previously, andrewbaptist (Andrew Baptist) wrote…

is -> are

Done.


pkg/kv/kvserver/kvstorage/init.go line 281 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: This would be more idiomatic if you used log.EveryN and then use call log.ShouldLog.

With log.EveryN we would have an undesirable first message at the very start.


pkg/kv/kvserver/kvstorage/init.go line 316 at r1 (raw file):
We check above that the key starts with /Local/Range/start-key/rdsc/. Normally we would have only one or more MVCC values like /Local/Range/start-key/rdsc/some-timestamp. For a transasction that was in progress when the node was shut down, we will have an unresolved intent, which is just the "bare" key /Local/Range/start-key/rdsc/ without a timestamp.

I don't know all the details around why we do this in inconsistent mode (but note that the handling would be more complicated otherwise). @bdarnell mentioned the following:

There is one special rule for rdsc that makes it a little easier: the transaction record for any modification to a rdsc record must be located on that range, and the intent is resolved synchronously with the commit/abort of the transaction.


pkg/kv/kvserver/kvstorage/init.go line 342 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Couldn't we call
iter.SeekGE(storage.MVCCKey{Key: nextRangeDescKey}) instead of NextKey.

This probably doesn't normally matter in the scheme of things but might be important if we had a large table delete and had not run an MVCCGC yet.

We don't have a nextRangeDescKey here..


pkg/kv/kvserver/kvstorage/init.go line 357 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I see this check in the previous commit as well, but what is the rational for failing on an uninitialized replica. I don't fully understand the lifecycle here, but all replicas pass through this state, but maybe the reason is that we neve persiste them to disk? This is more for my understanding of persisted RangeDescriptors but consider adding a comment why this is an assertion error or make the message cleaner.

This is the descriptor we just unmarshalled. We should never be writing out an uninitialized descriptor (hence the assertion error).


pkg/kv/kvserver/kvstorage/init.go line 377 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

It would be cleaner to delete the old code rather than leaving it here.

Done, I left it accidentally.

@RaduBerinde RaduBerinde requested a review from bdarnell May 14, 2024 17:34
@RaduBerinde RaduBerinde force-pushed the iter-range-desc branch 3 times, most recently from 484afa0 to 06a268d Compare May 15, 2024 03:23
Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r1, 6 of 6 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist, @jbowens, @nvanbenschoten, and @RaduBerinde)


pkg/kv/kvserver/kvstorage/init.go line 316 at r1 (raw file):

but note that the handling would be more complicated otherwise

Not just more complicated, but it would no longer be possible to resolve locally. We'd have to look up the transaction record, which may be on another node, and until then we couldn't be sure whether we should be using the RangeDescriptor value from the intent or the last committed value. And this ambiguity can have cascading effects - if a split or merge was left uncommitted, this need to be resolved before we can use the neighboring range as well. We can't bring a cluster up in the face of this ambiguity (Can't remember whether this is something that is definitely impossible or just something we'd rather not deal with), so we jump through hoops to ensure that intents on range descriptors can be resolved locally and we iterate inconsistently here.


pkg/kv/kvserver/kvstorage/init.go line 301 at r3 (raw file):

				iter.SeekGE(storage.MVCCKey{Key: keys.RangeDescriptorKey(keys.MustAddr(startKey))})
			} else {
				// This case should be rare in practice: we have a key that isn't

Can it ever happen? Should we log a warning (or panic?) if we see one?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist, @bdarnell, @jbowens, and @nvanbenschoten)


pkg/kv/kvserver/kvstorage/init.go line 301 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can it ever happen? Should we log a warning (or panic?) if we see one?

I am not sure exactly how things happen when we move a range to another node. We have to update the descriptor and range-delete these keys, do these operations happen together atomically?

If we think it shouldn't happen, I can add a warning and panic in crdb_test mode.

Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist, @jbowens, @nvanbenschoten, and @RaduBerinde)


pkg/kv/kvserver/kvstorage/init.go line 301 at r3 (raw file):

Previously, RaduBerinde wrote…

I am not sure exactly how things happen when we move a range to another node. We have to update the descriptor and range-delete these keys, do these operations happen together atomically?

If we think it shouldn't happen, I can add a warning and panic in crdb_test mode.

I would think it would all be atomic but I can't say for sure. If it's not atomic I don't think there's anything else that would asynchronously clean it up so we'd have some kind of leak. I'd be comfortable with at least putting in something like a panic in debug/test mode.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist, @bdarnell, @jbowens, and @nvanbenschoten)


pkg/kv/kvserver/kvstorage/init.go line 301 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I would think it would all be atomic but I can't say for sure. If it's not atomic I don't think there's anything else that would asynchronously clean it up so we'd have some kind of leak. I'd be comfortable with at least putting in something like a panic in debug/test mode.

Done.

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
@RaduBerinde
Copy link
Member Author

CI failure is #124197

bors r+

@craig craig bot merged commit 91fac21 into cockroachdb:master May 17, 2024
21 of 22 checks passed
@RaduBerinde RaduBerinde deleted the iter-range-desc branch May 17, 2024 15:21
@RaduBerinde
Copy link
Member Author

How do folks feel about backporting this to 24.1.x (after it bakes on master for a while)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvstorage: improve range descriptor iteration on startup
4 participants