Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstorage: improve range descriptor iteration on startup #109740

Closed
RaduBerinde opened this issue Aug 30, 2023 · 2 comments · Fixed by #123959
Closed

kvstorage: improve range descriptor iteration on startup #109740

RaduBerinde opened this issue Aug 30, 2023 · 2 comments · Fixed by #123959
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-storage Storage Team
Projects

Comments

@RaduBerinde
Copy link
Member

RaduBerinde commented Aug 30, 2023

Currently IterateRangeDescriptorsFromDisk iterates through all range-local keys:

	// MVCCIterator over all range-local key-based data.
	start := keys.RangeDescriptorKey(roachpb.RKeyMin)
	end := keys.RangeDescriptorKey(roachpb.RKeyMax)

There are several types of range-local keys, identified by suffixes:

	// LocalRangeProbeSuffix is the suffix for keys for probing.
	LocalRangeProbeSuffix = roachpb.RKey("prbe")
	// LocalQueueLastProcessedSuffix is the suffix for replica queue state keys.
	LocalQueueLastProcessedSuffix = roachpb.RKey("qlpt")
	// LocalRangeDescriptorSuffix is the suffix for keys storing
	// range descriptors. The value is a struct of type RangeDescriptor.
	LocalRangeDescriptorSuffix = roachpb.RKey("rdsc")
	// LocalTransactionSuffix specifies the key suffix for
	// transaction records. The additional detail is the transaction id.
	LocalTransactionSuffix = roachpb.RKey("txn-")

The iteration goes through all the keys, including the transaction records. This can be very slow on large stores.

A better strategy would be to do something like this in a loop:

  • seek to the the first key
  • replace the suffix with the local range descriptor suffix (rdsc) and seek to that key
  • replace the suffix with \xff and seek to that key

This would require O(1) iterator operations per range.

As part of implementing this, we should look at refactoring storage.MVCCIterate. We could replace it with an object that has an iterator interface which allows the caller to seek or iterate key-by-key (with buffering).

Jira issue: CRDB-31074

@RaduBerinde RaduBerinde added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Aug 30, 2023
@RaduBerinde RaduBerinde self-assigned this Aug 30, 2023
@RaduBerinde RaduBerinde added this to Incoming in Storage via automation Aug 30, 2023
@RaduBerinde RaduBerinde added the O-support Originated from a customer label Sep 5, 2023
@RaduBerinde RaduBerinde moved this from Incoming to Next in Storage Sep 5, 2023
@RaduBerinde RaduBerinde moved this from Next to 23.2 - Encryption At Rest in Storage Sep 5, 2023
@RaduBerinde RaduBerinde moved this from 23.2 - Encryption At Rest to 23.2 - Tactical Wins in Storage Sep 5, 2023
@nicktrav nicktrav moved this from 23.2 - Tactical Wins to 24.1 candidates in Storage Oct 3, 2023
@RaduBerinde RaduBerinde added the P-3 Issues/test failures with no fix SLA label Jan 10, 2024
@RaduBerinde
Copy link
Member Author

@jbowens @nicktrav we may want to prioritize this as it will help with rolling restarts

@jbowens
Copy link
Collaborator

jbowens commented May 2, 2024

@andrewbaptist

@nicktrav nicktrav moved this from 24.2 candidates to Next in Storage May 2, 2024
@RaduBerinde RaduBerinde moved this from Next to In Progress (this milestone) in Storage May 10, 2024
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 10, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. In some cases, there can be a lot of
`/Local/Range/Transaction` records so we end up reading through a lot
of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - when we encounter a key that has a suffix after `rdsc`, we
   seek beyond all keys for this range, e.g. `/Local/Range/fop/`.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be very
detrimental even when there aren't a lot of keys to skip.

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 12, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 12, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 13, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 14, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 14, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 14, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 14, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 15, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 15, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
craig bot pushed a commit that referenced this issue May 17, 2024
123959: kvstorage: speed up scan for range descriptors r=RaduBerinde a=RaduBerinde

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. In some cases, there can be a lot of
`/Local/Range/Transaction` records so we end up reading through a lot
of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - when we encounter a key that has a suffix after `rdsc`, we
   seek beyond all keys for this range, e.g. `/Local/Range/fop/`.

Note that inside Pebble, seeks to a key that is very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be very
detrimental even when there aren't a lot of keys to skip.

Fixes: #109740

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig craig bot closed this as completed in f9202a7 May 17, 2024
Storage automation moved this from In Progress (this milestone) to Done May 17, 2024
miyamo2 added a commit to miyamo2/cockroach that referenced this issue May 17, 2024
mod comment

lease: remove outdated NightlyStress call

This test config is not used anymore.

Release note: None

roachtest: increase timeout of acceptance/multitenant-multiregion

This has flaked due to timeout in CI.

Release note: None

sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package

This commit is a mechnical change that moves the gloval variable
`errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as
it will be used by the DSC when we support zone config in it.

Informs: cockroachdb#117574
Release note: None

scbuildstmt: Rerewrite a few function with better APIs

This commit cleaned up a few functions using the preferred .FilterXxx()
API to retrieve element from an element set. No functional change is
introduced.

Informs: cockroachdb#117574
Release note: None

roachtest: update dsc job compat to mixed version v241-v242

This patch updates the dsc job compat roachtest to
`declarative_schema_changer/job-compatibility-mixed-version-V241-V242`.

Epic: CRDB-35306
Informs: cockroachdb#116395

Release note: None

admission: introduce elastic io token exhausted duration metric

This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`.
This is similar to the existing
`admission.granter.io_tokens_exhausted_duration.kv`, but for elastic
traffic.

The patch also does some code refactoring to make it easier to use both
regular and elastic equivalent metrics.

Informs cockroachdb#121574.

Release note: None

sql: expose an ability to request redacted stmt bundles

This commit extends the stmt bundle collection infrastructure to support
requesting redacted stmt bundles. Previously, this was only possible via
an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds
the support via overloads to `crdb_internal.request_statement_bundle`
builtin as well as the server API. This paves the way for requesting the
redacted bundles via the DB Console, but it'll be done separately.

To support this we need to add another boolean column to
`system.statement_diagnostics_requests` table to indicate whether
a request is for the redacted bundle or not. This requires adding
a migration in which we populate the existing rows with all `false`
values.

In order to reduce the amount of code churn and simplify this change,
this commit repurposes some of the existing code that we introduced the
last time we added a migration for plan-gist matching in stmt bundles.
In other words, this commit simultenously removes some stale code and
handles the new column.

Release note: None

sql: simplify usage of some enum settings

This commit adjusts a few enum settings to use `EnumSetting.String` when
defining the session variables to clean up the code a bit.

Release note: None

execinfra: add a sanity check that DistSQL version is not bumped

The last bump was in 23.1, and we won't ever bump DistSQL version in
backwards-incompatible going forward.

Epic: None

Release note: None

Updated the AUTHORS file to include Naveen Setlur

sqlstats: remove unnecessary StatementFingerprintID construction

Previously we were constructing the stmt fingerprint id unnecessarily
in 2 functions in sqlstats:
- The Next function for the StmtStatsIterator
- PopAllStats, which returns the sql stats collected thus far and
resets the in-memory sql stats containers.

The ID is available in the recorded statement in both of these cases-
there's no need to reconstruct this value and doing is needlessly
expensive.

Epic: none

Release note: None

pgwire: add a metric for number of pipelined requests

Release note (ops change): Added the sql.pgwire.pipeline.count metric,
which is a gauge that measures how many wire protocol commands have been
received by the server, but have not yet begun processing. This metric
will only grow if clients are using the "pipeline mode" of the Postgres
wire protocol.

roachtest: skip multi-store-remove while debugging

This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.

clusterversion: remove deprecated internal versions

Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None

sql: fix the type of copy_num_retries_per_batch

We were incorrectly using the boolean getter for this integer variable.
The bug was introduced when this variable was added in
1c1d313.

Release note: None

sql: fix cost_scans_with_default_col_size

This commit fixes an omission where `cost_scans_with_default_col_size`
session variable didn't actually use the corresponding cluster setting
for its default value. In other words, the cluster setting actually had
no effect. The bug has been present since this variable was introduced
in af32d9d.

Release note: None

sql: fix up defaults for a few session variables

This commit adjusts global defaults for a few session variables to use
the postgres style (so that the global default value and session
variable value would be stored alike).

Release note: None

sql: improve env collection in stmt bundles

This commit refactors how we collect session variables and cluster
settings into `env.sql` file in the stmt bundle. Previously, we used
a hard-coded list of session variables that were always included while
ignoring all variables not in the list. If a variable had a value
different from the "binary default" (meaning the value that you got
right after the cluster startup, before any cluster setting
modifications are applied), then it would be in a SET statement that
would run on the `bundle recreate`; if the value was the same as
"binary default", then the SET statement was commented out. Also, all
cluster setting values were included into the file as a comment.

This commit makes it so that we include all session variables and
cluster settings that differ from their defaults. This allows us to
recreate the environment while also reducing the overhead of including
everything into `env.sql`.

For session variables further clarification is warranted:
- all read-only variables have been audited, and most are explicitly
excluded. Out of 23 currently present session variables, only 9 were
deemed to be possibly useful during investigations, so they are not
excluded (one of 9 is `crdb_version` which we print out separately, so
it's actually excluded too).
- for writable variables, we include it if its value differs from the
"binary" default (the one that you get on a fresh cluster) or from the
"cluster" default (the one that you get on a fresh session on the
current cluster). The latter kind is obtained via the provided
`settings.Values` container whereas the former is obtained via a global
singleton container.

Additionally, this commit adjusts the SET and SET CLUSTER SETTING
statements to have single quotes around the setting values that need them.
For cluster settings all types work with having single quotes, but for
session variables some (like integers) don't work with quotes while
others (like strings) need them. This commit adds a map for those that
need them as well as a simple test to catch some of the missing ones
(the list might be incomplete given that the test exercises the default
config). All values are adjusted to fit on a single line (we have some
cluster settings that might span multiple lines).

It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster
settings with non-default values.

Release note: None

roachprod: fetch secrets from cloud store

Due to the complexity of fetching the secrets from the secrets
manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none

roachprod: deploy cockroach

Previously, to upgrade a cluster it had to be done manually by running several
commands to copy over new binaries, drain and stop cockroach and then running
the same start command.

This change introduces a `deploy` command that works similar as `stage` but also
deploys the new version to the cluster by doing all of the above. It does it
node for node to ensure a cluster remains in a healthy state especially if a
load balancer is used to manage connections.

Epic: None
Release Note: None

roachprod: add deploy applications explanation

Epic: None
Release Notes: None

roachtest: add cluster settings operations

This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None

roachtest: fix `OperationRequiresNodes` dependency check

Epic: None
Release Note: None

roachprod: cluster settings operation owners

Epic: None
Release Notes: None

roachprod: fix UT to ignore the yaml format

currently, the test is looking for the exact match of the yaml.
this can fail as the yaml is generated from a map and the values
may not be in the same sequence. a better approach is to use
a yaml parser, which will be done very soon.

Fixes: cockroachdb#124276
Epic: none

pgwire: deflake TestPipelineMetric

The wrong context was used for the server, which meant that it could get
cancelled too early.

Release note: None

roachprod, roachtest: use same cluster name sanitization

Previously, roachtest had it's own function to sanitize
cluster names, while roachprod had it's own function to
verify cluster names. This change removes both and opts
instead to use vm.DNSSafeName.

This change also introduces MalformedClusterNameError
which gives a hint on what is wrong with the name and
tells roachtest not to retry cluster creation.

intentresolver: mark the test heavy to prevent OOM

Epic: none
Release note: none

ui: move enum type to avoid circular dependencies

When importing the ViewMode enum into a unit test, an error is thrown in
localStorage.reducer.ts relating to ViewMode being undefined. This is
occurring due to a circular dependency. To fix, ViewMode is moved to
a separate types.ts file

Release note: None

ui: prevent /_status/nodes calls for secondary tenants

/_status/nodes is not implemented for secondary tenants, resulting in 501
responses in cloud for serverless tenants. To fix, this api is now gated by
a tenant check and will no longer be called by serverless / secondary tenants.

Fixes: CC-26900
Epic: None
Release note: None

cli/doctor: doctor should read from the right jobs table

In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None

doctor: skip validation for dropped descriptors

In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: cockroachdb#123477
Fixes: cockroachdb#122956

Release note: none

backupccl: test that we actually downloaded all data during online restore

Release note: none

Epic: none

roachtest: check downloaded spans in online restore roundtrip test

Epic: none

Release note: none

workflows: set timeouts for GitHub Actions Essential CI jobs

I've selected the timeouts according to how long each job seems to run
in practice, with a big buffer to allow for load on the remote execution
cluster, etc. The longest-running jobs I've given 60 minutes, with less
time for the shorter jobs.

Epic: CRDB-8308
Release note: None

master: Update pkg/testutils/release/cockroach_releases.yaml

Update pkg/testutils/release/cockroach_releases.yaml with recent values.

Epic: None
Release note: None
Release justification: test-only updates

cdctest: simplify orderValidator.NoteRow code

This patch eliminates unnecessary nesting in orderValidator's
NoteRow method.

Release note: None

cdctest: rename MakeCountValidator to NewCountValidator

This patch renames `MakeCountValidator` to `NewCountValidator` to
reflect that it returns a pointer, not a struct.

Release note: None

roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging

This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to
more accurately reflect that it is a Kafka-only test. It also adds
logging for the Kafka chaos loop iteration number so that we can tell
when the Kafka cluster is restarting from the logs.

Release note: None

roachtest: add order validation to CDC Kafka roachtests

This patch adds order validation to CDC Kafka roachtests so that we
can build more confidence in our ordering guarantees. It can be enabled
for a roachtest either by directly setting the `validateOrder` flag on a
`kafkaManager` before creating consumers, or indirectly by setting the
`validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`.

Release note: None

sql: alter primary key hangs if index references exist

Previously, ALTER PRIMARY KEY could hang if an index reference existed
from either a view or function using the index selection syntax. This
was becasue neither the declarative schema changer or legacy schema
changer support updating these references. To address this, this patch
will prevent ALTER PRIMARY KEY if such references exist to any indexes
that will be recreated.

Fixes: cockroachdb#123017

Release note (bug fix): ALTER PRIMARY KEY could hang if the table had
any indexes which were referred to by views or functions using the force
index syntax.

roachprod: remove logging from updatePrometheusTargets

It's hard to tell what the logging is about when it happens in the
context of a roachtest. It's also verbose as it prints one log line
for each VM whenever a test starts cockroach.

Epic: none

Release note: None

roachtest: explicitly close monitor reader after closing session

Previously the reader did not close, even though the session is closed. This
change ensures the reader is closed by closing it after the session has been
closed. This seems to have only been a problem with local clusters.

Fixes: cockroachdb#116314

Epic: None
Release Note: None

roachtest: lower max-upgrades in `tpcc/mixed-headroom`

Avoids timeouts.

Fixes: cockroachdb#124264

Release note: None

scripts: update `drtprod` create drt-large

Bring the creation in line with what is currently running on cockroach-drt. Uses
zone expansion to balance the nodes evenly across regions.

Epic: None
Release Note: None

scripts: update `drtprod` drt-large region

In an effort to save cost on regional network transfers the Warsaw Europe Zone
will now be switched to Columbus.

Epic: None
Release Note: None

backupccl: deflake TestDataDriven_external_connections_privileges

This test already tests the permissions end-to-end by asserting that
the user can or cannot complete the expected actions. No need to print
the system database as well.

Fixes cockroachdb#123379
Release note: None

externalconn: implement `SHOW EXTERNAL CONNECTIONS`.

This new query displays redacted URIs of external connections along
with other valuable information.

Epic: CRDB-15001
Fixes: cockroachdb#85905

Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and
`SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries
display redacted connection URIs and other useful information such as the
connection type. Access to these queries is restricted to the owner of the
connection or users with `USAGE` privilege. Admin or root users will have
unrestricted access to all connections.

kvstorage: speed up scan for range descriptors

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740

kvserver: read lease under mutex when switching lease type

A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None

logictest: skip flaky select_for_update test

Informs: cockroachdb#124197.
Epic: None

Release note: None

mod comment

roachtest: save cluster earlier if debug-always is set

When the --debug-always flag is passed, we mark the cluster
as saved to let the cluster destroy know not to destroy it.

Previously, we only marked a cluster as saved at the end of
a test. However, if the caller terminated the test through
ctrl c, which also destroys all clusters, it would lead to a
race condition where sometimes the cluster would not be saved
in time and get destroyed.

This change moves the cluster save to immediantly after the
cluster is created.

Release note: none
Fixes: none
Epic: none
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-storage Storage Team
Projects
Storage
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants