[SPARK-48219][CORE] StreamReader Charset fix with UTF8 #46509

xuzifu666 · 2024-05-09T13:05:27Z

What changes were proposed in this pull request?

Fix some StreamReader not set with UTF8，if we actually default charset not support Chinese chars such as latin and conf contain Chinese chars，it would not resolve success，so we need set it as utf8 in StreamReader，we can find all StreamReader with utf8 charset in other compute framework，such as Calcite、Hive、Hudi and so on.

Why are the changes needed?

May cause string decode not as expected

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Not need

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

Do you think you can provide a test coverage to protect your contribution from potential future regression, @xuzifu666 ?

Not need

xuzifu666 · 2024-05-10T01:55:17Z

Do you think you can provide a test coverage to protect your contribution from potential future regression, @xuzifu666 ?

Not need

@dongjoon-hyun Thanks for your attentions，In my option this code change not need to provide tests for it's a specification for ReadStream usage，if not set utf8 charset may occur error when system default charset not contains Chinese Chars. You can refer it in other framework such as Calcite，Hive，all set utf8 when InputStreamReader constructor method be called.

Stale review

xuzifu666 · 2024-05-10T05:39:11Z

@dongjoon-hyun Could you give a final review? Thanks

dongjoon-hyun · 2024-05-10T05:41:34Z

Sorry but I'll leave this to the other reviewers, @xuzifu666 .

xuzifu666 · 2024-05-10T06:12:54Z

@HyukjinKwon could you help to give a review? Thanks

yaooqinn · 2024-05-10T08:18:14Z

The change itself looks reasonable to me. I also agree with @dongjoon-hyun that we shall add a simple test, maybe in XSDToSchemaSuite.

BTW, the PR is tagged as CORE but the changes belong to XML of sql datasource and hive thriftserver

yaooqinn · 2024-05-10T08:19:26Z

sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java

@@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws IOException {
      FileInputStream initStream = null;
      BufferedReader bufferedReader = null;
      initStream = new FileInputStream(fileName);
-      bufferedReader = new BufferedReader(new InputStreamReader(initStream));
+      bufferedReader = new BufferedReader(new InputStreamReader(initStream, StandardCharsets.UTF_8));


the code here is copied from Hive, do we have the same issue in the upstream repo? Better to cite here

OK，I would address it @yaooqinn

Yeah, I wouldn't touch here.

HyukjinKwon · 2024-05-10T08:21:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala

@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
    val in = ValidatorUtil.openSchemaFile(xsdPath)
    val xmlSchemaCollection = new XmlSchemaCollection()
    xmlSchemaCollection.setBaseUri(xsdPath.toString)
-    val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+    val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8))


This is arguable because if you don't specify the encoding, then it will pick the system default encoding up. If your XSD file is written in other encodings, but here tries to read UTF-8, it will fail.

Other places they have to be UTF-8 because Spark encodes so explicitly.

Thanks，I cite hive firstly，than feedback to this pr? @HyukjinKwon XSD not do the change，only change hive

As described above, I wouldn't touch Hive side.

xuzifu666 · 2024-05-16T04:07:26Z

XSDtoSchema would not modify it， than HiveImpl had also changed can refer recent pr: apache/hive#5243 so I Think it is nesscery to change it？ @yaooqinn @HyukjinKwon

yaooqinn · 2024-05-16T04:13:36Z

Thank you @xuzifu666.

Merged to master

[SPARK-48219][core] StreamReader Charset fix with UTF8

6c0bf86

github-actions bot added the SQL label May 9, 2024

xuzifu666 changed the title ~~[SPARK-48219][core] StreamReader Charset fix with UTF8~~ [SPARK-48219][CORE] StreamReader Charset fix with UTF8 May 9, 2024

dongjoon-hyun previously requested changes May 9, 2024

View reviewed changes

yaooqinn reviewed May 10, 2024

View reviewed changes

HyukjinKwon reviewed May 10, 2024

View reviewed changes

fix

5814b2f

yaooqinn approved these changes May 16, 2024

View reviewed changes

yaooqinn closed this in 5e83221 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48219][CORE] StreamReader Charset fix with UTF8 #46509

[SPARK-48219][CORE] StreamReader Charset fix with UTF8 #46509

xuzifu666 commented May 9, 2024 •

edited

dongjoon-hyun left a comment

xuzifu666 commented May 10, 2024 •

edited

xuzifu666 commented May 10, 2024

dongjoon-hyun commented May 10, 2024

xuzifu666 commented May 10, 2024

yaooqinn commented May 10, 2024

yaooqinn May 10, 2024

xuzifu666 May 10, 2024

HyukjinKwon May 10, 2024

HyukjinKwon May 10, 2024

xuzifu666 May 10, 2024 •

edited

HyukjinKwon May 10, 2024

xuzifu666 commented May 16, 2024 •

edited

yaooqinn commented May 16, 2024

[SPARK-48219][CORE] StreamReader Charset fix with UTF8 #46509

[SPARK-48219][CORE] StreamReader Charset fix with UTF8 #46509

Conversation

xuzifu666 commented May 9, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

xuzifu666 commented May 10, 2024 • edited

xuzifu666 commented May 10, 2024

dongjoon-hyun commented May 10, 2024

xuzifu666 commented May 10, 2024

yaooqinn commented May 10, 2024

yaooqinn May 10, 2024

Choose a reason for hiding this comment

xuzifu666 May 10, 2024

Choose a reason for hiding this comment

HyukjinKwon May 10, 2024

Choose a reason for hiding this comment

HyukjinKwon May 10, 2024

Choose a reason for hiding this comment

xuzifu666 May 10, 2024 • edited

Choose a reason for hiding this comment

HyukjinKwon May 10, 2024

Choose a reason for hiding this comment

xuzifu666 commented May 16, 2024 • edited

yaooqinn commented May 16, 2024

xuzifu666 commented May 9, 2024 •

edited

xuzifu666 commented May 10, 2024 •

edited

xuzifu666 May 10, 2024 •

edited

xuzifu666 commented May 16, 2024 •

edited