SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

Gowthami03B · 2024-05-02T18:23:25Z

Feature Request / Improvement

Hello

The current snapshot procedure (https://iceberg.apache.org/docs/nightly/spark-procedures/?h=spark_catalog#snapshot) seems to be helpful in only migrating from external Hive to iceberg tables.

But we have a unique use case where we want to migrate some of our tables from one namespace to another and later run 'alter schema operations' (which is metadata only) that would have worked perfectly for us with the "snapshot" procedure since it utilizes the underlying data files while having the new table's metadata in a new location.
The rest of tables in the old namespace would have to be backfilled as we have major changes, but we would avoid a bunch of effort and storage space(talking TB's here) if we could use snapshot procedure.

spark_jdbc_config = {
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.my_catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.my_catalog.catalog-impl": "org.apache.iceberg.jdbc.JdbcCatalog",
    "spark.sql.catalog.my_catalog.uri": "jdbc:comdb2://",
    "spark.sql.catalog.my_catalog.warehouse": "s3a://abc",   
 }
spark.sql(
            f"""
                CALL my_catalog.system.snapshot(
                    source_table => 'ns1.src_dataset' **# SparkConnectGrpcException: (org.apache.iceberg.exceptions.NoSuchTableException) Cannot not find source table 'datasets.equitynamr'**
                    table => 'ns2.src_dataset',
                    location => 's3a://abc'
                )
            """
        )

my_catalog here is the JDBC catalog that holds both the namespaces ns1, ns2 and all of our tables.

When I try to provide source_table as fully qualified name (my_catalog.ns1.src_dataset), I get this -IllegalArgumentException: Cannot snapshot a table that isn't in the session catalog (i.e. spark_catalog). Found source catalog: test.

I also tried explicitly creating a table with a catalog entry for 'spark_catalog', and that resulted in - IllegalArgumentException: Cannot use non-v1 table 'ns1.src_datasets' as a source

Is there any workaround to achieve my use case? Does this seem like a valid request that can be accommodated?

Also tried exploring the add_files procedure, but it does currently take only a prefix of the s3 path to the data files location of the source table and not a list of file paths from current snapshot's data files. It would rather be helpful to add only files that are part of the current snapshot.

Query engine

spark

The text was updated successfully, but these errors were encountered:

Gowthami03B added the improvement PR that improves existing functionality label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

Gowthami03B commented May 2, 2024 •

edited

SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

Comments

Gowthami03B commented May 2, 2024 • edited

Feature Request / Improvement

Query engine

Gowthami03B commented May 2, 2024 •

edited