Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the Iceberg catalog in your file system #10326

Open
911432 opened this issue May 13, 2024 · 5 comments
Open

Using the Iceberg catalog in your file system #10326

911432 opened this issue May 13, 2024 · 5 comments
Labels
improvement PR that improves existing functionality

Comments

@911432
Copy link
Contributor

911432 commented May 13, 2024

Feature Request / Improvement

Just as we can now store our iceberg catalog in HDFS, we also want to store it in other file systems such as S3.
You can then quickly configure it as a container image, including query engines and storage.

Query engine

None

@911432 911432 added the improvement PR that improves existing functionality label May 13, 2024
@nastra
Copy link
Contributor

nastra commented May 13, 2024

@911432 can you please elaborate what the goal here is? Everything you described is already possible today.

@911432
Copy link
Contributor Author

911432 commented May 13, 2024

I would like to store the query engine as a container image and the iceberg table and iceberg catalog as a file system.
Let's take this spark page as an example.
The code below is always valid.

spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hadoop_prod.type = hadoop
spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path

I wish I could do the code below as well.

spark.sql.catalog.s3 = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.s3.type = s3
spark.sql.catalog.s3.warehouse = s3://nn:8020/warehouse/path
spark.sql.catalog.file = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.file.type = file
spark.sql.catalog.file.warehouse = file://warehouse/path

I think it will make the spark-quickstart page easier. And I think I can distinguish between computing and storage more clearly.

@nastra
Copy link
Contributor

nastra commented May 13, 2024

spark.sql.catalog.<catalogName>.type refers to different catalog implementation types and is not related to the naming of the catalog. It is basically just a shortcut to specifying the fully-qualified catalog name via spark.sql.catalog.<catalogName>.catalog-impl=org.apache.iceberg.hadoop.HadoopCatalog.

Available catalog types are:

hadoop -> org.apache.iceberg.hadoop.HadoopCatalog
hive -> org.apache.iceberg.hive.HiveCatalog
rest -> org.apache.iceberg.rest.RESTCatalog
glue -> org.apache.iceberg.aws.glue.GlueCatalog
nessie -> org.apache.iceberg.nessie.NessieCatalog
jdbc -> org.apache.iceberg.jdbc.JdbcCatalog

@911432
Copy link
Contributor Author

911432 commented May 13, 2024

I know Spark.sql.catalog.hadoop_prod.uri doesn't seem to exist.
Similarly, for s3 and file, I hope Spark.sql.catalog.<catalogName>.warehouse is sufficient even without Spark.sql.catalog.<catalogName>.uri.

@BsoBird
Copy link

BsoBird commented May 27, 2024

Hi, I've done some work on fixing hadoop_catalog before.

In my experience, to use a filesystem-based catalog, you currently need to rely on the filesystem to provide atomic rename operations. Object stores often do not have atomic operations. To use fileSystem_catalog with object storage, you must use some additional middleware to provide atomicity to file system operations.

In addition, this type of middleware often provides multiple access protocols, such as HDFS/S3/POSIX. When you use this type of middleware proxy to access the object store, it seems that hadoop_catalog is already sufficient.

Of course, this is just the status quo. I think there is a lot of work that needs to be done if you want to implement the basic functionality of catalog management on an object store that does not have atomic operations. We can discuss this further if you are interested.

But please keep in mind that this is not recommended in the current version.

@911432 @nastra

@911432 Also, I see that you have submitted some PRs for apache paimon, and I'm sure you'd like paimon to have similar functionality, but unfortunately, paimon still has consistency issues with filesystem_catalog in s3. This is all due to the fact that the object store does not provide atomic operations.If you are interested, you can try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants