Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

Closed
wants to merge 2 commits into from

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented May 11, 2024

What changes were proposed in this pull request?

CSV table containing char and varchar columns will result in the following error when selecting from the CSV table:

spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>).
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)

Why are the changes needed?

For char and varchar types, Spark will convert them to StringType in CharVarcharUtils.replaceCharVarcharWithStringInSchema and record __CHAR_VARCHAR_TYPE_STRING in the metadata.

The reason for the above error is that the StringType columns in the dataSchema and requiredSchema of UnivocityParser are not consistent. The StringType in the dataSchema has metadata, while the metadata in the requiredSchema is empty. We need to retain the metadata when resolving schema.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test case in CSVSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 11, 2024
@liujiayi771
Copy link
Contributor Author

Hi @ulysses-you Could you help review?

case a: AttributeReference => a
case a: AttributeReference =>
// Keep the metadata in given schema.
a.copy(metadata = field.metadata)(exprId = a.exprId, qualifier = a.qualifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a.withMetadata(field.metadata)

Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm if tests pass, cc @yaooqinn @cloud-fan

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@cloud-fan
Copy link
Contributor

cloud-fan commented May 13, 2024

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in b14abb3 May 13, 2024
@cloud-fan
Copy link
Contributor

it has conflicts with 3.5, can you create a new backport PR?

@liujiayi771
Copy link
Contributor Author

it has conflicts with 3.5, can you create a new backport PR?

Create a backport PR in #46565.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants