Using the Iceberg catalog in your file system #10326

911432 · 2024-05-13T04:52:40Z

Feature Request / Improvement

Just as we can now store our iceberg catalog in HDFS, we also want to store it in other file systems such as S3.
You can then quickly configure it as a container image, including query engines and storage.

Query engine

None

nastra · 2024-05-13T08:28:36Z

@911432 can you please elaborate what the goal here is? Everything you described is already possible today.

911432 · 2024-05-13T13:49:32Z

I would like to store the query engine as a container image and the iceberg table and iceberg catalog as a file system.
Let's take this spark page as an example.
The code below is always valid.

spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hadoop_prod.type = hadoop
spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path

I wish I could do the code below as well.

spark.sql.catalog.s3 = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.s3.type = s3
spark.sql.catalog.s3.warehouse = s3://nn:8020/warehouse/path

spark.sql.catalog.file = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.file.type = file
spark.sql.catalog.file.warehouse = file://warehouse/path

I think it will make the spark-quickstart page easier. And I think I can distinguish between computing and storage more clearly.

nastra · 2024-05-13T13:55:35Z

spark.sql.catalog.<catalogName>.type refers to different catalog implementation types and is not related to the naming of the catalog. It is basically just a shortcut to specifying the fully-qualified catalog name via spark.sql.catalog.<catalogName>.catalog-impl=org.apache.iceberg.hadoop.HadoopCatalog.

Available catalog types are:

hadoop -> org.apache.iceberg.hadoop.HadoopCatalog
hive -> org.apache.iceberg.hive.HiveCatalog
rest -> org.apache.iceberg.rest.RESTCatalog
glue -> org.apache.iceberg.aws.glue.GlueCatalog
nessie -> org.apache.iceberg.nessie.NessieCatalog
jdbc -> org.apache.iceberg.jdbc.JdbcCatalog

911432 · 2024-05-13T14:31:50Z

I know Spark.sql.catalog.hadoop_prod.uri doesn't seem to exist.
Similarly, for s3 and file, I hope Spark.sql.catalog.<catalogName>.warehouse is sufficient even without Spark.sql.catalog.<catalogName>.uri.

BsoBird · 2024-05-27T09:41:51Z

Hi, I've done some work on fixing hadoop_catalog before.

In my experience, to use a filesystem-based catalog, you currently need to rely on the filesystem to provide atomic rename operations. Object stores often do not have atomic operations. To use fileSystem_catalog with object storage, you must use some additional middleware to provide atomicity to file system operations.

In addition, this type of middleware often provides multiple access protocols, such as HDFS/S3/POSIX. When you use this type of middleware proxy to access the object store, it seems that hadoop_catalog is already sufficient.

Of course, this is just the status quo. I think there is a lot of work that needs to be done if you want to implement the basic functionality of catalog management on an object store that does not have atomic operations. We can discuss this further if you are interested.

But please keep in mind that this is not recommended in the current version.

@911432 @nastra

@911432 Also, I see that you have submitted some PRs for apache paimon, and I'm sure you'd like paimon to have similar functionality, but unfortunately, paimon still has consistency issues with filesystem_catalog in s3. This is all due to the fact that the object store does not provide atomic operations.If you are interested, you can try it.

911432 added the improvement PR that improves existing functionality label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the Iceberg catalog in your file system #10326

Using the Iceberg catalog in your file system #10326

911432 commented May 13, 2024

nastra commented May 13, 2024

911432 commented May 13, 2024

nastra commented May 13, 2024

911432 commented May 13, 2024

BsoBird commented May 27, 2024 •

edited

Using the Iceberg catalog in your file system #10326

Using the Iceberg catalog in your file system #10326

Comments

911432 commented May 13, 2024

Feature Request / Improvement

Query engine

nastra commented May 13, 2024

911432 commented May 13, 2024

nastra commented May 13, 2024

911432 commented May 13, 2024

BsoBird commented May 27, 2024 • edited

BsoBird commented May 27, 2024 •

edited