Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

Open
Gowthami03B opened this issue May 2, 2024 · 0 comments
Labels
improvement PR that improves existing functionality

Comments

@Gowthami03B
Copy link

Gowthami03B commented May 2, 2024

Feature Request / Improvement

Hello

The current snapshot procedure (https://iceberg.apache.org/docs/nightly/spark-procedures/?h=spark_catalog#snapshot) seems to be helpful in only migrating from external Hive to iceberg tables.

But we have a unique use case where we want to migrate some of our tables from one namespace to another and later run 'alter schema operations' (which is metadata only) that would have worked perfectly for us with the "snapshot" procedure since it utilizes the underlying data files while having the new table's metadata in a new location.
The rest of tables in the old namespace would have to be backfilled as we have major changes, but we would avoid a bunch of effort and storage space(talking TB's here) if we could use snapshot procedure.

spark_jdbc_config = {
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.my_catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.my_catalog.catalog-impl": "org.apache.iceberg.jdbc.JdbcCatalog",
    "spark.sql.catalog.my_catalog.uri": "jdbc:comdb2://",
    "spark.sql.catalog.my_catalog.warehouse": "s3a://abc",   
 }
spark.sql(
            f"""
                CALL my_catalog.system.snapshot(
                    source_table => 'ns1.src_dataset' **# SparkConnectGrpcException: (org.apache.iceberg.exceptions.NoSuchTableException) Cannot not find source table 'datasets.equitynamr'**
                    table => 'ns2.src_dataset',
                    location => 's3a://abc'
                )
            """
        )

my_catalog here is the JDBC catalog that holds both the namespaces ns1, ns2 and all of our tables.

When I try to provide source_table as fully qualified name (my_catalog.ns1.src_dataset), I get this -IllegalArgumentException: Cannot snapshot a table that isn't in the session catalog (i.e. spark_catalog). Found source catalog: test.

I also tried explicitly creating a table with a catalog entry for 'spark_catalog', and that resulted in - IllegalArgumentException: Cannot use non-v1 table 'ns1.src_datasets' as a source

Is there any workaround to achieve my use case? Does this seem like a valid request that can be accommodated?

Also tried exploring the add_files procedure, but it does currently take only a prefix of the s3 path to the data files location of the source table and not a list of file paths from current snapshot's data files. It would rather be helpful to add only files that are part of the current snapshot.

Query engine

spark

@Gowthami03B Gowthami03B added the improvement PR that improves existing functionality label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

1 participant