Spark Action to Analyze table #10288

karuppayya · 2024-05-08T05:59:24Z

This change adds a Spark action to Analyze tables.
As part of analysis, the action generates Apache data - sketch for NDV stats and writes it as puffins.

karuppayya · 2024-05-08T05:59:53Z

cc: @RussellSpitzer @aokolnychyi @huaxingao @findepi

ajantha-bhat · 2024-05-08T06:12:17Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Computes the statistic of the given columns and stores it as Puffin files. */


AnalyzeTableSparkAction is a generic name, I see that in future we want to compute the partition stats too. Which may not be written as puffin files.

Either we can change the change the naming to computeNDVSketches or make it generic such that any kind of stats can be computed from this.

Thinking more on this, I think we should just call it computeNDVSketches and not mix it with partition stats.

I tried to follow the model of RDMS and Engines like Trino using ANALYZE TABLE <tblName> to collect all table level stats.
With a procedure per stats model, the user have to invoke procedure/action for every stats and
also with any new stats addition, the user need to ensure to update his code to call the new procedure/action.

not mix it with partition stats.

I think we could have partition stats as a separate action since it per partition, whereas this procedure can collect top level table stats.

ajantha-bhat · 2024-05-08T07:10:16Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+              spark(), table, columnsToBeAnalyzed.toArray(new String[0]));
+      table
+          .updateStatistics()
+          .setStatistics(table.currentSnapshot().snapshotId(), statisticsFile)


what if table's current snapshot has modified concurrently by another client between like 117 to line 120?

ajantha-bhat · 2024-05-08T07:14:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+
+  public static Iterator<Tuple2<String, ThetaSketchJavaSerializable>> computeNDVSketches(
+      SparkSession spark, String tableName, String... columns) {
+    String sql = String.format("select %s from %s", String.join(",", columns), tableName);


I think we should also think about incremental update and update sketches from previous checkpoint. Querying whole table maybe not efficient.

Yes, incremental need to be wired into the ends of write paths.
This procedure could exist in parallel, which could get stats of the whole table on demand.

ajantha-bhat · 2024-05-08T07:16:11Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+    assumeTrue(catalogName.equals("spark_catalog"));
+    sql(
+        "CREATE TABLE %s (id int, data string) USING iceberg TBLPROPERTIES"
+            + "('format-version'='2')",


default format version itself v2 now. So, specifying it again is redundant.

ajantha-bhat · 2024-05-08T07:17:19Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    String path = operations.metadataFileLocation(String.format("%s.stats", UUID.randomUUID()));
+    OutputFile outputFile = fileIO.newOutputFile(path);
+    try (PuffinWriter writer =
+        Puffin.write(outputFile).createdBy("Spark DistinctCountProcedure").build()) {


I like this name instead of "analyze table procedure".

ajantha-bhat · 2024-05-15T10:41:15Z

there was an old PR on the same: #6582

huaxingao · 2024-05-15T15:02:00Z

there was an old PR on the same: #6582

I don't have time to work on this, so karuppayya will take over. Thanks a lot @karuppayya for continuing the work.

krajendran4 added 2 commits May 7, 2024 17:18

core +api changes

8d346d8

Analyze table Spark action

7774ca6

github-actions bot added API spark core build labels May 8, 2024

ajantha-bhat reviewed May 8, 2024

View reviewed changes

ajantha-bhat mentioned this pull request May 15, 2024

Add a Spark procedure to collect NDV #6582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Action to Analyze table #10288

Spark Action to Analyze table #10288

karuppayya commented May 8, 2024

karuppayya commented May 8, 2024

ajantha-bhat May 8, 2024

ajantha-bhat May 8, 2024

karuppayya May 20, 2024

ajantha-bhat May 8, 2024 •

edited

ajantha-bhat May 8, 2024

karuppayya May 20, 2024

ajantha-bhat May 8, 2024

ajantha-bhat May 8, 2024

ajantha-bhat commented May 15, 2024

huaxingao commented May 15, 2024

Spark Action to Analyze table #10288

Are you sure you want to change the base?

Spark Action to Analyze table #10288

Conversation

karuppayya commented May 8, 2024

karuppayya commented May 8, 2024

ajantha-bhat May 8, 2024

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024

Choose a reason for hiding this comment

karuppayya May 20, 2024

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024 • edited

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024

Choose a reason for hiding this comment

karuppayya May 20, 2024

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024

Choose a reason for hiding this comment

ajantha-bhat commented May 15, 2024

huaxingao commented May 15, 2024

ajantha-bhat May 8, 2024 •

edited