Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

Merged
merged 17 commits into from
Jun 5, 2024

Conversation

the-other-tim-brown
Copy link
Contributor

Change Logs

  • Adds ability to get consistent ordering of fields during the schema reconciliation steps

Impact

  • Consistent ordering of fields within a schema can help users reduce potential issues with HMS or other metastores.

Risk level (write none, low medium or high below)

None, just reorders fields

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@the-other-tim-brown the-other-tim-brown changed the title Enforce ordering of fields during schema reconciliation [HUDI-7713] Enforce ordering of fields during schema reconciliation May 5, 2024
Schema expected = createRecord("reorderNestedFields",
createPrimitiveField("field1", Schema.Type.INT),
createPrimitiveField("field2", Schema.Type.INT),
createArrayField("field3", createRecord("reorderNestedFields.field3",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex can you confirm this is the expected naming after reconcile is run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that doesn't look right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should it look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be "nestedRecord" I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you please change the nested record name to reorderNestedFields.field3 in start and end? That way we isolate what we are testing

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label May 5, 2024
@jonvex jonvex self-assigned this May 6, 2024
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 20, 2024
@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review May 21, 2024 02:04
InternalSchema targetInternalSchema = convert(targetSchema);
// Use existing fieldIds for consistent field ordering between commits when shouldReorderColumns is true
InternalSchema sourceInternalSchema = convert(sourceSchema, shouldReorderColumns ? targetInternalSchema.getNameToPosition() : Collections.emptyMap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only source schema? wny not reorder target schema too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target schema is the source for the ordering. In this code, the target schema is the existing table and the source is the incoming dataset

val canonicalizedSourceSchema = if (shouldCanonicalizeSchema) {
canonicalizeSchema(sourceSchema, latestTableSchema, opts)
canonicalizeSchema(sourceSchema, latestTableSchema, opts, !shouldReconcileSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shouldn't we reorder columns when reconcile schema is true? Can you please add a note in the comment regarding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex advised to do this. I think it is because reconcile is schema on read?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconcile is not necessarily dependent on schema on read. I think the reason might have been to not conflict schema reconciliation rules incase that is enabled. @jonvex to clarify. Whatever be the reason, let's add a comment for reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcile has been deprecated, so we shouldn't modify it's behavior

import static org.apache.hudi.internal.schema.utils.InternalSchemaUtils.createFullName;

/**
* Schema visitor to produce name -> id map for internalSchema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Schema visitor to produce name -> id map for internalSchema.
* Schema visitor to produce name -> position map for internalSchema, where position indicates position of the field in the schema.

@@ -67,6 +68,10 @@ public Map<String, Integer> buildNameToId(Type type) {
return visit(type, new NameToIDVisitor());
}

Map<String, Integer> buildNameToPosition(Type type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question: Do we use the InternalSchemaBuilder even when schema on read is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal schema seems to provide some nice utilities but I was not familiar with it before this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use internal schema even when schema on read is disabled. We use it to add null for missing columns, promote incoming batch if it can be promoted to the table schema, and also to fix the ordering of unions

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@jonvex can you also review?

@@ -169,20 +169,35 @@ class TestBasicSchemaEvolution extends HoodieSparkClientTestBase with ScalaAsser
// 2. Write 2d batch with another schema (added column `age`)
//

val secondSchema = StructType(
val secondInputSchema = StructType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the behavior be the same when reconcile is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but reconcile is not always enabled

Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments

Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments

return buildTypeFromAvroSchema(schema, existingFieldNameToPositionMapping);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

AtomicInteger nextId = new AtomicInteger(1);
return visitAvroSchemaToBuildType(schema, visited, true, nextId);
Deque<String> visited = new LinkedList<>();
AtomicInteger nextId = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we go from 1->0? Is this because we remove

 if (firstVisitRoot) {
          nextAssignId = 0;
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was a bug since you typically start with 0 when coding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also yes, I think it was confusing in general that nextAssignId is set to 0 yet this code is saying it should start at 1

Schema expected = createRecord("reorderNestedFields",
createPrimitiveField("field1", Schema.Type.INT),
createPrimitiveField("field2", Schema.Type.INT),
createArrayField("field3", createRecord("reorderNestedFields.field3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you please change the nested record name to reorderNestedFields.field3 in start and end? That way we isolate what we are testing

@hudi-bot
Copy link

hudi-bot commented Jun 2, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jonvex jonvex self-requested a review June 5, 2024 00:21
Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jonvex jonvex merged commit d964895 into apache:master Jun 5, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants