Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan #10252

manuzhang · 2024-04-30T04:08:38Z

This fixes #10247

…changelog scan

manuzhang · 2024-05-01T14:56:48Z

@flyrain @aokolnychyi please help review

manuzhang · 2024-05-22T03:42:45Z

Gentle ping @flyrain @aokolnychyi

flyrain

Thanks for the fix. @manuzhang. It looks good to me overall. Left minor comments.

flyrain · 2024-05-26T19:49:14Z

...5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestChangelogTable.java

+        changelogRecords(null, rightAfterSnap2));
+
+    assertEquals(
+        "Should have expected changed rows from snapshot 2 and 3",


It should not include change rows from snapshot 2. The result is correct, but the message is bit misleading. How about something like this?

Should have expected changed rows from snapshot 3 only since snapshot 2 is in a different branch.

flyrain · 2024-05-26T19:51:17Z

...5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestChangelogTable.java

+        ImmutableList.of(
+            row(1, "a", "DELETE", 0, snap3.snapshotId()),
+            row(-2, "a", "INSERT", 0, snap3.snapshotId())),
+        changelogRecords(rightAfterSnap2, null));


Can we add more cases by rollbacking to snapshot 2? We want to test it doesn't pick up the latest snapshot 3 when it is not in the main branch.

flyrain · 2024-05-26T20:10:53Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+      if (current.timestampMillis() <= endTimestamp) {
+        snapshotId = current.snapshotId();
+      } else {
+        for (Snapshot ancestor : SnapshotUtil.currentAncestors(table)) {
+          if (ancestor.timestampMillis() <= endTimestamp) {
+            snapshotId = ancestor.snapshotId();
+            break;
+          }
+        }
+      }
+    }


SnapshotUtil.currentAncestors(table) includes the current snapshot as well. We could simplify the logic a bit.

Also I think we can move this method to the class SnapshotUtil. It could be useful for other scans as well. For example, I'm not sure if time travel query like this has the similar bug. We can double check on that, but it's a blocker for this PR.

SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00';

github-actions bot added spark core labels Apr 30, 2024

manuzhang force-pushed the fix-changelog-rollback branch from 13715b0 to 6ccd2d8 Compare April 30, 2024 05:11

Spark 3.5: Only traverse ancestors of current snapshot when building …

debc745

…changelog scan

manuzhang force-pushed the fix-changelog-rollback branch from 6ccd2d8 to debc745 Compare April 30, 2024 08:00

manuzhang changed the title ~~Spark 3.5: Skip rolled back snapshot when building changelog scan~~ Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan Apr 30, 2024

flyrain reviewed May 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan #10252

Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan #10252

manuzhang commented Apr 30, 2024

manuzhang commented May 1, 2024

manuzhang commented May 22, 2024

flyrain left a comment

flyrain May 26, 2024

flyrain May 26, 2024

flyrain May 26, 2024

flyrain May 26, 2024

Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan #10252

Are you sure you want to change the base?

Spark 3.5: Only traverse ancestors of current snapshot when building changelog scan #10252

Conversation

manuzhang commented Apr 30, 2024

manuzhang commented May 1, 2024

manuzhang commented May 22, 2024

flyrain left a comment

Choose a reason for hiding this comment

flyrain May 26, 2024

Choose a reason for hiding this comment

flyrain May 26, 2024

Choose a reason for hiding this comment

flyrain May 26, 2024

Choose a reason for hiding this comment

flyrain May 26, 2024

Choose a reason for hiding this comment