Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: refresh stats for multi-tenant system database conversions #123905

Merged
merged 1 commit into from May 15, 2024

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented May 9, 2024

Previously, when converting the system database to multiregion its possible for table statistics to contain the existing type of crdb_region as bytes. This could happen if automatic statistics collection happened concurrently with the conversion to a multi-region system database. The conversion had logic to clear table statistics, but it was still possible for statistics collection to happen in between. This could cause queries against RBR system tables to fail because, since the table_statistics type information no longer matches with the table descriptor after. We started seeing this for the system database inside TestMrSystemDatabase, once conversion was added for the system tenant. To address this, this patch first adds extra logic in the schema changer to force a refresh of stats on system tables, which will force a refresh of statistics after the schema change, in case a stats refresh occurs before the job completes. We also modify the TestMrSystemDatabase to intentionally generate stats before changing the system database under the system tenant to avoid the risk of hitting this issue. With these changes we expect the test to no longer flake and any real world occurrence to be less transient.

Fixes: #122790

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the TestMrSystemDatabaseStatsFlake branch from 496b45a to 7c040ec Compare May 9, 2024 18:57
@fqazi fqazi marked this pull request as ready for review May 9, 2024 19:49
@fqazi fqazi requested a review from a team as a code owner May 9, 2024 19:49
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)


pkg/sql/schema_changer.go line 920 at r1 (raw file):

	}

	if sc.mutationID == descpb.InvalidMutationID {

just trying to understand the code better: why is sc.mutationID invalid in this case?


pkg/sql/schema_changer.go line 955 at r1 (raw file):

	defer func() {
		if err := waitToUpdateLeases(err == nil /* refreshStats */); err != nil && !errors.Is(err, catalog.ErrDescriptorNotFound) {

this is unrelated to your PR, but this looks like a bug. instead of err == nil should this be retErr == nil, and make retErr and named return parameter of this function?

Previously, when converting the system database to multiregion its
possible for table statistics to contain the existing type of
crdb_region as bytes. This could happen if automatic statistics
collection happened concurrently with the conversion to a multi-region
system database. The conversion had logic to clear table statistics, but
it was still possible for statistics collection to happen in between.
This could cause queries against RBR system tables to fail because,
since the table_statistics type information no longer matches with the table
descriptor after. We started seeing this for the system database inside
TestMrSystemDatabase, once conversion was added for the system tenant.
To address this, this patch first adds extra logic in the schema changer
to force a refresh of stats on system tables, which will force a refresh
of statistics after the schema change, in case a stats refresh occurs
before the job completes. We also modify the TestMrSystemDatabase to
intentionally generate stats before changing the system database under
the system tenant to avoid the risk of hitting this issue. With these
changes we expect the test to no longer flake and any real world
occurrence to be less transient.

Fixes: cockroachdb#122790

Release note: None
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/schema_changer.go line 920 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

just trying to understand the code better: why is sc.mutationID invalid in this case?

So, the way we change column types for the region column is that we directly modify the descriptor, without any mutation. So, the job is created without any mutationID assigned. The only thing the job ends up doing is waiting for new version to be adopted in this case.


pkg/sql/schema_changer.go line 955 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this is unrelated to your PR, but this looks like a bug. instead of err == nil should this be retErr == nil, and make retErr and named return parameter of this function?

Done.

Nice catch!

@fqazi fqazi requested a review from rafiss May 14, 2024 17:23
@fqazi fqazi force-pushed the TestMrSystemDatabaseStatsFlake branch from 7c040ec to dd4ea19 Compare May 14, 2024 17:23
@fqazi
Copy link
Collaborator Author

fqazi commented May 15, 2024

@rafiss TFTR!

bors r+

@craig craig bot merged commit 6300c3c into cockroachdb:master May 15, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ccl/multiregionccl: TestMrSystemDatabase failed
3 participants