Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geospatial Support #10260

Open
1 of 6 tasks
szehon-ho opened this issue May 2, 2024 · 2 comments
Open
1 of 6 tasks

Geospatial Support #10260

szehon-ho opened this issue May 2, 2024 · 2 comments
Labels
proposal Iceberg Improvement Proposal (spec/major changes/etc)

Comments

@szehon-ho
Copy link
Collaborator

szehon-ho commented May 2, 2024

Proposed Change

(This is an abridged version of the proposal document)

Big data open source projects have been leveraged for storage and analysis of geospatial data for a long time, and a flourishing ecosystem has evolved. Examples are GeoParquet for Parquet, Sedona for Spark, GeoMesa for HBase and Cassandra, and in-development or completed native support in Hive and Trino. Given the central position of Apache Iceberg table format in the stack, it would be great to natively support geospatial support as well.

There have been implementations of geospatial support in Iceberg (Geolake and Havasu) which have promising results. Unfortunately as Iceberg lacks Extension points, these have been in the form of forks of the project. It would be great to leverage the efforts and findings of these projects in adding native support to Iceberg.

This will add the following to the Iceberg project:

  • Geospatial types (ex, point, linestring, polygon)
  • Geospatial expressions (st_covers, st_covered_by, st_intersects)
  • Geospatial partition transforms (XZ2)
  • Geospatial sort (hilbert)
  • Spark integration support

This will allow the following use cases:

  • Create a table with geospatial type
    CREATE TABLE geom_table (geom GEOMETRY);
  • Insert geospatial data
    INSERT INTO geom_table VALUES ('POINT(1 2)', 'LINESTRING(1 2, 3 4)')
  • Query using geospatial predicates:
    SELECT * FROM geom_table WHERE ST_COVERS(geom, ST_POINT(0.5, 0.5))
  • Define a geospatial partition transform to allow partition filtering for geospatial query
    ALTER TABLE geom_table ADD PARTITION FIELD (xz2(geom))
  • Rewrite using geospatial sort order to allow file and row-group filtering for geospatial query
    CALL rewrite_data_files(table => `geom_table`, sort_order => `hilbert(geom)`)

Proposal document

https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI

Specifications

  • Table
  • View
  • REST
  • Puffin
  • Encryption
  • Other
@szehon-ho szehon-ho added the proposal Iceberg Improvement Proposal (spec/major changes/etc) label May 2, 2024
@szehon-ho
Copy link
Collaborator Author

szehon-ho commented May 2, 2024

Note: special thanks to @jiayuasu and @Kontinuation from Wherobots for invaluable domain specific advice and POC support from Havasu Iceberg-fork and Geolake, and also @badbye and other members of Geolake for support.

Also thanks @aokolnychyi and @hsiang-c for reviewing locally.

@jiayuasu
Copy link
Member

jiayuasu commented May 2, 2024

Looking forward to the feedback from Iceberg community!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Iceberg Improvement Proposal (spec/major changes/etc)
Projects
None yet
Development

No branches or pull requests

2 participants