Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Retain Glue Catalog column comment after updating Iceberg table #10276

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lawofcycles
Copy link
Contributor

@lawofcycles lawofcycles commented May 6, 2024

What this PR does

This PR addresses the issue where column comments in the Glue catalog are not retained when updating an existing Iceberg table schema. It fixes the problem by modifying the IcebergToGlueConverter.setTableInputInformation method to preserve existing column comments from the Glue table when creating a new TableInput for an update operation.

When an existing Glue table is provided, the modified method retrieves the comments from its columns and applies them to the corresponding columns in the new TableInput, but only if the Iceberg table's column does not already have a comment.

This change ensures that any user-defined column comments in the Glue catalog are carried over during table updates, preventing unintentional loss of documentation.

Testing

Added a new unit test method testSetTableInputInformationWithExistingTable to verify that column comments from an existing Glue table are correctly preserved when creating a new TableInput.
I'll also add Integration Tests same as #10199.

Fixes #10220

@github-actions github-actions bot added the AWS label May 6, 2024
@lawofcycles
Copy link
Contributor Author

I built the jars and manually tested on Amazon EMR as a Spark runtime.

@lawofcycles lawofcycles changed the title AWS: Updating Glue catalog table removes column descriptions AWS: Retain Glue Catalog column comment after updating Iceberg table May 12, 2024
@lawofcycles
Copy link
Contributor Author

lawofcycles commented May 13, 2024

Hi @geruh would you review this PR?
I noticed you are reviewing #10199, so I'd like to ask for this similar issue.

List<Column> columns = toColumns(metadata);
if (existingTable != null) {
List<Column> existingColumns = existingTable.storageDescriptor().columns();
Map<String, Column> existingColumnMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this a map of column name to comment here. Since we aren't using any additional information from the existing comments

@@ -501,6 +501,97 @@ public void testColumnCommentsAndParameters() {
assertThat(actualColumns).isEqualTo(expectedColumns);
}

@Test
public void testGlueTableColumnCommentsPreserved() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where we drop a column from a table. I believe today we keep those columns and descriptions. Should we add a test for that?

Copy link
Contributor

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @lawofcycles! I've taken an initial pass. Thanks for the solid work!

@lawofcycles
Copy link
Contributor Author

@geruh Thank you for your review!
I have pushed the changes addressing your review comments.
Could you please take another look when you have a chance?

Comment on lines 258 to 259
Optional.ofNullable(properties.get(GLUE_DESCRIPTION_KEY))
.ifPresent(tableInputBuilder::description);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lawofcycles thank you for your patch!

Now this method has the existing table information, so I think it would be better to move the function to keep the table description https://github.com/aajisaka/iceberg/blob/58c61241bd61de7a4062dbf664691f05eeb2ea53/aws/src/main/java/org/apache/iceberg/aws/glue/GlueTableOperations.java#L319-L321 into this method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS: Updating Glue catalog table removes column descriptions
3 participants