Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUMNS output #22675

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

imjalpreet
Copy link
Member

@imjalpreet imjalpreet commented May 6, 2024

Description

After the change:
Screenshot 2024-05-06 at 2 02 50 PM

Motivation and Context

Fixes #22638

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@imjalpreet imjalpreet requested review from hantangwangd and a team as code owners May 6, 2024 08:34
@imjalpreet imjalpreet requested a review from presto-oss May 6, 2024 08:34
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, a little question for discussing. Do you think we should also show the non-identity partition transform? For example when I create an Iceberg table as follows:

create table test_table(a int, b varchar, c timestamp) with (partitioning = ARRAY['a', 'truncate(b, 2)', 'year(c)']);

Then the show create table would show all the partition transforms:

presto:default> show create table test_table;
                             Create Table                              
-----------------------------------------------------------------------
 CREATE TABLE iceberg.default.test_table (                                    
    "a" integer,                                                       
    "b" varchar,                                                       
    "c" timestamp                                                      
 )                                                                     
 WITH (                                                                
    delete_mode = 'merge-on-read',                                     
    format = 'PARQUET',                                                
    format_version = '2',                                              
    location = 'file:/Users/wangd/work/data/iceberg/data/default/test_table', 
    partitioning = ARRAY['a','truncate(b, 2)','year(c)']               
 )          

Maybe it's better to keep desc be consistent with show create table, what's your opinion?

Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs tests

@imjalpreet
Copy link
Member Author

Do you think we should also show the non-identity partition transform?

@hantangwangd I think that's a valid ask. What would you suggest we should mention in the Extra info in this case? Should we mention the transformation being used for the hidden partitioning for the respective columns?

@hantangwangd
Copy link
Member

@imjalpreet For non-identity partition column, I think show the transform information would be enough, maybe something like partition by '<transform.toString()>'. So that it would be shown as follows:

presto:default> desc test_table;
 Column |   Type    |           Extra            | Comment 
--------+-----------+----------------------------+---------
 a      | integer   | partition key              |         
 b      | varchar   | partition by 'truncate[2]' |         
 c      | timestamp | partition by 'year'        |         
(3 rows)

presto:default> show columns in test_table;
 Column |   Type    |           Extra            | Comment 
--------+-----------+----------------------------+---------
 a      | integer   | partition key              |         
 b      | varchar   | partition by 'truncate[2]' |         
 c      | timestamp | partition by 'year'        |         
(3 rows)

Is that ok? Or do you have a better idea?

@imjalpreet
Copy link
Member Author

@hantangwangd let's say for a date/timestamp column we have two hidden partition transforms year and month. What would be the best way to display that? Should we write partition by year, month or is there a better way to communicate that there are two hidden partition transforms on this column?

@hantangwangd
Copy link
Member

@imjalpreet Thanks for providing this great question, so that we can discuss and handle it. Yes, Iceberg allows create multiple transforms on a column. After a careful check in the spec and the implementation, I found the follow details:

  1. Multiple transforms on the same column has a explicit sequence, and Iceberg use the same sequence to organize the partitioned data files.
  2. Year/Month/Day/Hour belong to the same kind of transform, so that they can not be created on the same column.
  3. The column types that could be applied with Year/Month/Day/Hour is disjoint with the columns types that could be applied with Truncate.

So we can get some conclusions:

  1. Transforms after Identity transform is somewhat meaningless, as they would always get the same value, so maybe we should disallow it explicitly?
  2. The most complex definition is something like ARRAY['truncate(a, 2)', 'bucket(a, 16)', 'a'] or ARRAY['year(b)', 'bucket(b, 16)', 'b'], including three different kind of transforms.

So do you think the following shown examples is reasonable?

partition key			//This is the most common scenario
partition by year
partition by hour
partition by month, identity
partition by truncate(2)
partition by truncate(2), bucket(16)
partition by bucket(16)
partition by bucket(16), month
partition by truncate(2), bucket(16), identity
partition by year, bucket(16), identity
......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iceberg tables do not show partition key in DESCRIBE
3 participants