[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUMNS output #22675

imjalpreet · 2024-05-06T08:34:30Z

Description

After the change:

Motivation and Context

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== NO RELEASE NOTE ==

…NS output

hantangwangd

Thanks for the fix, a little question for discussing. Do you think we should also show the non-identity partition transform? For example when I create an Iceberg table as follows:

create table test_table(a int, b varchar, c timestamp) with (partitioning = ARRAY['a', 'truncate(b, 2)', 'year(c)']);

Then the show create table would show all the partition transforms:

presto:default> show create table test_table;
                             Create Table                              
-----------------------------------------------------------------------
 CREATE TABLE iceberg.default.test_table (                                    
    "a" integer,                                                       
    "b" varchar,                                                       
    "c" timestamp                                                      
 )                                                                     
 WITH (                                                                
    delete_mode = 'merge-on-read',                                     
    format = 'PARQUET',                                                
    format_version = '2',                                              
    location = 'file:/Users/wangd/work/data/iceberg/data/default/test_table', 
    partitioning = ARRAY['a','truncate(b, 2)','year(c)']               
 )

Maybe it's better to keep desc be consistent with show create table, what's your opinion?

elharo

needs tests

imjalpreet · 2024-05-07T07:39:34Z

Do you think we should also show the non-identity partition transform?

@hantangwangd I think that's a valid ask. What would you suggest we should mention in the Extra info in this case? Should we mention the transformation being used for the hidden partitioning for the respective columns?

hantangwangd · 2024-05-07T09:47:36Z

@imjalpreet For non-identity partition column, I think show the transform information would be enough, maybe something like partition by '<transform.toString()>'. So that it would be shown as follows:

presto:default> desc test_table;
 Column |   Type    |           Extra            | Comment 
--------+-----------+----------------------------+---------
 a      | integer   | partition key              |         
 b      | varchar   | partition by 'truncate[2]' |         
 c      | timestamp | partition by 'year'        |         
(3 rows)

presto:default> show columns in test_table;
 Column |   Type    |           Extra            | Comment 
--------+-----------+----------------------------+---------
 a      | integer   | partition key              |         
 b      | varchar   | partition by 'truncate[2]' |         
 c      | timestamp | partition by 'year'        |         
(3 rows)

Is that ok? Or do you have a better idea?

imjalpreet · 2024-05-07T10:36:00Z

@hantangwangd let's say for a date/timestamp column we have two hidden partition transforms year and month. What would be the best way to display that? Should we write partition by year, month or is there a better way to communicate that there are two hidden partition transforms on this column?

hantangwangd · 2024-05-07T12:03:38Z

@imjalpreet Thanks for providing this great question, so that we can discuss and handle it. Yes, Iceberg allows create multiple transforms on a column. After a careful check in the spec and the implementation, I found the follow details:

Multiple transforms on the same column has a explicit sequence, and Iceberg use the same sequence to organize the partitioned data files.
Year/Month/Day/Hour belong to the same kind of transform, so that they can not be created on the same column.
The column types that could be applied with Year/Month/Day/Hour is disjoint with the columns types that could be applied with Truncate.

So we can get some conclusions:

Transforms after Identity transform is somewhat meaningless, as they would always get the same value, so maybe we should disallow it explicitly?
The most complex definition is something like ARRAY['truncate(a, 2)', 'bucket(a, 16)', 'a'] or ARRAY['year(b)', 'bucket(b, 16)', 'b'], including three different kind of transforms.

So do you think the following shown examples is reasonable?

partition key			//This is the most common scenario
partition by year
partition by hour
partition by month, identity
partition by truncate(2)
partition by truncate(2), bucket(16)
partition by bucket(16)
partition by bucket(16), month
partition by truncate(2), bucket(16), identity
partition by year, bucket(16), identity
......

[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUM…

dc267b2

…NS output

imjalpreet requested review from hantangwangd and a team as code owners May 6, 2024 08:34

imjalpreet requested a review from presto-oss May 6, 2024 08:34

hantangwangd reviewed May 6, 2024

View reviewed changes

elharo reviewed May 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUMNS output #22675

[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUMNS output #22675

imjalpreet commented May 6, 2024 •

edited by elharo

hantangwangd left a comment

elharo left a comment

imjalpreet commented May 7, 2024

hantangwangd commented May 7, 2024

imjalpreet commented May 7, 2024

hantangwangd commented May 7, 2024

[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUMNS output #22675

Are you sure you want to change the base?

[iceberg] Tag columns with "partition key" in DESCRIBE and SHOW COLUMNS output #22675

Conversation

imjalpreet commented May 6, 2024 • edited by elharo

Description

Motivation and Context

Contributor checklist

Release Notes

hantangwangd left a comment

Choose a reason for hiding this comment

elharo left a comment

Choose a reason for hiding this comment

imjalpreet commented May 7, 2024

hantangwangd commented May 7, 2024

imjalpreet commented May 7, 2024

hantangwangd commented May 7, 2024

imjalpreet commented May 6, 2024 •

edited by elharo