[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

the-other-tim-brown · 2024-05-05T18:31:07Z

Change Logs

Adds ability to get consistent ordering of fields during the schema reconciliation steps

Impact

Consistent ordering of fields within a schema can help users reduce potential issues with HMS or other metastores.

Risk level (write none, low medium or high below)

None, just reorders fields

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

the-other-tim-brown · 2024-05-05T18:33:07Z

...-spark-datasource/hudi-spark-common/src/test/java/org/apache/hudi/TestHoodieSchemaUtils.java

+    Schema expected = createRecord("reorderNestedFields",
+        createPrimitiveField("field1", Schema.Type.INT),
+        createPrimitiveField("field2", Schema.Type.INT),
+        createArrayField("field3", createRecord("reorderNestedFields.field3",


@jonvex can you confirm this is the expected naming after reconcile is run?

no, that doesn't look right

What should it look like?

it should be "nestedRecord" I think

https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java#L341 - this logic was not created/updated by me. Do you want me to change it as part of this PR?

ok, can you please change the nested record name to reorderNestedFields.field3 in start and end? That way we isolate what we are testing

…ring

codope · 2024-05-21T16:31:25Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java

    InternalSchema targetInternalSchema = convert(targetSchema);
+    // Use existing fieldIds for consistent field ordering between commits when shouldReorderColumns is true
+    InternalSchema sourceInternalSchema = convert(sourceSchema, shouldReorderColumns ? targetInternalSchema.getNameToPosition() : Collections.emptyMap());


why only source schema? wny not reorder target schema too?

The target schema is the source for the ordering. In this code, the target schema is the existing table and the source is the incoming dataset

codope · 2024-05-21T16:34:30Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala

        val canonicalizedSourceSchema = if (shouldCanonicalizeSchema) {
-          canonicalizeSchema(sourceSchema, latestTableSchema, opts)
+          canonicalizeSchema(sourceSchema, latestTableSchema, opts, !shouldReconcileSchema)


why shouldn't we reorder columns when reconcile schema is true? Can you please add a note in the comment regarding this?

@jonvex advised to do this. I think it is because reconcile is schema on read?

Reconcile is not necessarily dependent on schema on read. I think the reason might have been to not conflict schema reconciliation rules incase that is enabled. @jonvex to clarify. Whatever be the reason, let's add a comment for reference.

reconcile has been deprecated, so we shouldn't modify it's behavior

hudi-common/src/main/java/org/apache/hudi/internal/schema/visitor/NameToPositionVisitor.java

codope · 2024-05-21T16:39:20Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/visitor/NameToPositionVisitor.java

+import static org.apache.hudi.internal.schema.utils.InternalSchemaUtils.createFullName;
+
+/**
+ * Schema visitor to produce name -> id map for internalSchema.


Suggested change

* Schema visitor to produce name -> id map for internalSchema.

* Schema visitor to produce name -> position map for internalSchema, where position indicates position of the field in the schema.

codope · 2024-05-21T16:41:53Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchemaBuilder.java

@@ -67,6 +68,10 @@ public Map<String, Integer> buildNameToId(Type type) {
    return visit(type, new NameToIDVisitor());
  }

+  Map<String, Integer> buildNameToPosition(Type type) {


High level question: Do we use the InternalSchemaBuilder even when schema on read is disabled?

The internal schema seems to provide some nice utilities but I was not familiar with it before this change

We use internal schema even when schema on read is disabled. We use it to add null for missing columns, promote incoming batch if it can be promoted to the table schema, and also to fix the ordering of unions

codope

LGTM.
@jonvex can you also review?

jonvex · 2024-05-23T17:38:11Z

...tasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBasicSchemaEvolution.scala

@@ -169,20 +169,35 @@ class TestBasicSchemaEvolution extends HoodieSparkClientTestBase with ScalaAsser
    // 2. Write 2d batch with another schema (added column `age`)
    //

-    val secondSchema = StructType(
+    val secondInputSchema = StructType(


shouldn't the behavior be the same when reconcile is enabled?

Yes, but reconcile is not always enabled

jonvex

added comments

…ring

jonvex

a few minor comments

jonvex · 2024-05-31T16:27:10Z

...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java

+    return buildTypeFromAvroSchema(schema, existingFieldNameToPositionMapping);
+  }
+
+


remove empty line

jonvex · 2024-05-31T16:27:58Z

...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java

-    AtomicInteger nextId = new AtomicInteger(1);
-    return visitAvroSchemaToBuildType(schema, visited, true, nextId);
+    Deque<String> visited = new LinkedList<>();
+    AtomicInteger nextId = new AtomicInteger(0);


why do we go from 1->0? Is this because we remove

if (firstVisitRoot) { nextAssignId = 0; }

I thought this was a bug since you typically start with 0 when coding

Also yes, I think it was confusing in general that nextAssignId is set to 0 yet this code is saying it should start at 1

jonvex · 2024-05-31T16:32:08Z

...-spark-datasource/hudi-spark-common/src/test/java/org/apache/hudi/TestHoodieSchemaUtils.java

+    Schema expected = createRecord("reorderNestedFields",
+        createPrimitiveField("field1", Schema.Type.INT),
+        createPrimitiveField("field2", Schema.Type.INT),
+        createArrayField("field3", createRecord("reorderNestedFields.field3",


ok, can you please change the nested record name to reorderNestedFields.field3 in start and end? That way we isolate what we are testing

hudi-bot · 2024-06-02T19:06:39Z

CI report:

12038db UNKNOWN
66bec1a Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

jonvex

LGTM

Enforce ordering of fields during schema reconciliation

ccd7978

the-other-tim-brown changed the title ~~Enforce ordering of fields during schema reconciliation~~ [HUDI-7713] Enforce ordering of fields during schema reconciliation May 5, 2024

the-other-tim-brown commented May 5, 2024

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label May 5, 2024

the-other-tim-brown added 5 commits May 5, 2024 17:57

update ordering of expected columns

c5bfe41

update tests to match new expectations

235093b

update test field ordering

0ca3823

set input batch schema

12038db

reorder row fields as well

b475756

jonvex self-assigned this May 6, 2024

the-other-tim-brown added 6 commits May 6, 2024 09:41

only reorder fields if reconcile is false

5f0a670

fix schema evoution test expectations around field ordering

fbb9dd5

Merge remote-tracking branch 'origin/master' into preserve-field-orde…

a2f928c

…ring

minor updates

a6fab78

allow for different ordering modes

2c40dd9

use position instead of existin field ID

d481ef5

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 20, 2024

the-other-tim-brown marked this pull request as ready for review May 21, 2024 02:04

codope reviewed May 21, 2024

View reviewed changes

pr feedback, add test, fix visitor

00b4e2d

codope approved these changes May 22, 2024

View reviewed changes

jonvex reviewed May 23, 2024

View reviewed changes

jonvex requested changes May 23, 2024

View reviewed changes

the-other-tim-brown added 3 commits May 29, 2024 12:47

Merge remote-tracking branch 'origin/master' into preserve-field-orde…

88b2a8b

…ring

cleanup test

11bf799

fix assertion

0d8349e

jonvex requested changes May 31, 2024

View reviewed changes

pr feedback

66bec1a

jonvex self-requested a review June 5, 2024 00:21

jonvex approved these changes Jun 5, 2024

View reviewed changes

jonvex merged commit d964895 into apache:master Jun 5, 2024
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

the-other-tim-brown commented May 5, 2024

the-other-tim-brown May 5, 2024

jonvex May 23, 2024

the-other-tim-brown May 23, 2024

jonvex May 29, 2024

the-other-tim-brown May 29, 2024

jonvex May 31, 2024

codope May 21, 2024

the-other-tim-brown May 21, 2024

codope May 21, 2024

the-other-tim-brown May 21, 2024

codope May 22, 2024

jonvex May 23, 2024

codope May 21, 2024

codope May 21, 2024

the-other-tim-brown May 21, 2024

jonvex May 23, 2024

codope left a comment

jonvex May 23, 2024

the-other-tim-brown May 23, 2024

jonvex left a comment

jonvex left a comment

jonvex May 31, 2024

jonvex May 31, 2024

the-other-tim-brown May 31, 2024

the-other-tim-brown Jun 2, 2024

jonvex May 31, 2024

hudi-bot commented Jun 2, 2024

jonvex left a comment

	* Schema visitor to produce name -> id map for internalSchema.
	* Schema visitor to produce name -> position map for internalSchema, where position indicates position of the field in the schema.

		return buildTypeFromAvroSchema(schema, existingFieldNameToPositionMapping);
		}

[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

Conversation

the-other-tim-brown commented May 5, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonvex left a comment

Choose a reason for hiding this comment

jonvex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jun 2, 2024

CI report:

jonvex left a comment

Choose a reason for hiding this comment