Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Allow type widening for all supported type changes with Spark 4.0 #3024

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

johanl-db
Copy link
Collaborator

@johanl-db johanl-db commented May 2, 2024

This PR adds shims to ungate the remaining type changes that only work with Spark 4.0 / master. Spark 4.0 contains the required changes to Parquet readers to be able to read the data after applying the type changes.

Description

Extend the list of supported type changes for type widening to include changes that can be supported with Spark 4.0:

  • (byte, short, int) -> long
  • float -> double
  • date -> timestampNTZ
  • (byte, short, int) -> double
  • decimal -> decimal (with increased precision/scale that doesn't cause precision loss)
  • (byte, short, int, long) -> decimal

Shims are added to support these changes when compiling against Spark 4.0/master and to only allow byte -> short - > int when compiling against Spark 3.5.

How was this patch tested?

Adding test cases for the new type changes in the existing type widening test suites. The list of supported / unsupported changes covered in tests differs between Spark 3.5 and Spark 4.0, shims are also provided to handle this.

Does this PR introduce any user-facing changes?

Yes: allow using the listed type changes with type widening, either via ALTER TABLE CHANGE COLUMN TYPE or during schema evolution in MERGE and INSERT.

@johanl-db johanl-db added this to the 4.0.0 milestone May 2, 2024
@johanl-db johanl-db self-assigned this May 2, 2024
@KamilKandzia
Copy link

Will be in future an option to change the column type of a table from int to string without overwriting the entire table? Unless such an option is now available (but I don't remember that)

@johanl-db
Copy link
Collaborator Author

Will be in future an option to change the column type of a table from int to string without overwriting the entire table? Unless such an option is now available (but I don't remember that)

There's no plan currently to support other type changes than the ones mentioned in the PR description.

Converting values when reading from a table that had one of these widening type changes applied can be easily done directly in the Parquet reader, but other type changes are harder either because:

  • They can lead to overflow or loss of precision. For example long -> int or long -> float.
  • The conversion is ambiguous in Parquet. For float -> string: how many significant digits should be displayed? For decimal -> string: should the value be padded with 0s to match the precision/scale of the value. Even for int -> string, we could ask if the raw bytes of the initial value should be returned as string or the value should be formatted as UTF8.

@johanl-db johanl-db changed the title [WIP][Spark] Allow type widening for all supported type changes [Spark][4.0] Allow type widening for all supported type changes May 21, 2024
@johanl-db johanl-db changed the title [Spark][4.0] Allow type widening for all supported type changes [Spark] Allow type widening for all supported type changes May 22, 2024
@johanl-db johanl-db changed the title [Spark] Allow type widening for all supported type changes [Spark] Allow type widening for all supported type changes with Spark 4.0 May 22, 2024
Copy link
Contributor

@sabir-akhadov sabir-akhadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, left some comments


/**
* Type widening only supports a limited set of type changes with Spark 3.5 due to the parquet
* readers lacking the corresponding conversions that were added in Spark 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the mechanics but shouldn't this go in a scala-spark-4.0 directory instead of master? What happens when the 4.0 is cut/released?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 4.0 isn't cut yet so that's not possible, the build system only knows master and latest (3.5) currently. I imagine once spark 4.0 is cut, the scala-spark-master folder will be copied over to scala-spark-4.0

case (ByteType | ShortType, IntegerType) => true
case _ => false
}
TypeWideningShims.isTypeChangeSupported(fromType, toType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, how is the TypeWideningShims object visible here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build script accepts an argument sparkVersion that toggles between two different build targets, each pulling its own set of shim files:
https://github.com/delta-io/delta/blob/master/build.sbt#L163

TypeWideningShims is declared in the same package as TypeWidening so it's imported implicitly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants