Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan #10252

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manuzhang
Copy link
Contributor

This fixes #10247

@manuzhang manuzhang changed the title Spark 3.5: Skip rolled back snapshot when building changelog scan Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan Apr 30, 2024
@manuzhang
Copy link
Contributor Author

@flyrain @aokolnychyi please help review

@manuzhang
Copy link
Contributor Author

Gentle ping @flyrain @aokolnychyi

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. @manuzhang. It looks good to me overall. Left minor comments.

changelogRecords(null, rightAfterSnap2));

assertEquals(
"Should have expected changed rows from snapshot 2 and 3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not include change rows from snapshot 2. The result is correct, but the message is bit misleading. How about something like this?

Should have expected changed rows from snapshot 3 only since snapshot 2 is in a different branch.

ImmutableList.of(
row(1, "a", "DELETE", 0, snap3.snapshotId()),
row(-2, "a", "INSERT", 0, snap3.snapshotId())),
changelogRecords(rightAfterSnap2, null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more cases by rollbacking to snapshot 2? We want to test it doesn't pick up the latest snapshot 3 when it is not in the main branch.

Comment on lines +596 to +606
if (current.timestampMillis() <= endTimestamp) {
snapshotId = current.snapshotId();
} else {
for (Snapshot ancestor : SnapshotUtil.currentAncestors(table)) {
if (ancestor.timestampMillis() <= endTimestamp) {
snapshotId = ancestor.snapshotId();
break;
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnapshotUtil.currentAncestors(table) includes the current snapshot as well. We could simplify the logic a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we can move this method to the class SnapshotUtil. It could be useful for other scans as well. For example, I'm not sure if time travel query like this has the similar bug. We can double check on that, but it's a blocker for this PR.

SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark: CDC does not respect when the table is rolled back.
2 participants