Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47972][SQL] Restrict CAST expression for collations #46474

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mihailom-db
Copy link
Contributor

What changes were proposed in this pull request?

Block of syntax CAST(value AS STRING COLLATE collation_name).

Why are the changes needed?

Current state of code allows for calls like CAST(1 AS STRING COLLATE UNICODE). We want to restrict CAST expression to only be able to cast to default collation string, and to only allow COLLATE expression to produce explicitly collated strings.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Test in CollationSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@mihailom-db mihailom-db changed the title [SPARK-47972][SQL] Restrict CAST expression for collations [WIP][SPARK-47972][SQL] Restrict CAST expression for collations May 8, 2024
@github-actions github-actions bot added the SQL label May 8, 2024
@mihailom-db mihailom-db changed the title [WIP][SPARK-47972][SQL] Restrict CAST expression for collations [SPARK-47972][SQL] Restrict CAST expression for collations May 10, 2024

checkAnswer(sql(s"SELECT cast(1 as string)"), Seq(Row("1")))
checkAnswer(sql(s"SELECT cast('A' as string)"), Seq(Row("A")))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering, how does this cope with something like:

withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
  checkAnswer(sql(s"SELECT cast(1 as string)"), Seq(Row("1")))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a test for this, but it should return a string "1" collated with UNICODE collation. The aim of this PR is to block usage of STRING COLLATE collation_name in cast, all previous usage for default collation stay the same.

@mihailom-db
Copy link
Contributor Author

@dbatomic @stefankandic @stevomitric @nikolamand-db could you take a look at this PR

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the approach of adding this flag to StringType is the way to go, then I'd say lgtm

@mihailom-db perhaps you could discuss below the pros/cons of this versus just blocking cast w/ collate in sql

@stefankandic
Copy link
Contributor

Why do we want to do this? Should it be a parser error? Also, I don't think the error message is very clear - we should at least let the user know how to perform the cast to another collation.

case Seq(_) => SqlApiConf.get.defaultStringType
case Seq(_) =>
val st = SqlApiConf.get.defaultStringType
st.createdAsNonCollated = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid mutating the state of the defaultStringType?

@mihailom-db
Copy link
Contributor Author

@stefankandic SQL Standard does not support for casting to collated strings. I agree with the message thing. The standard only allows for CAST(1 AS STRING) COLLATE UNICODE, so that is what we were trying to achieve with this restriction on cast expression. The problem is that we have Spark Connect which can call col.cast(StringType(collation_name)) or col.cast("STRING COLLATE UNICODE") and we do not have an easy way of differentiating StringType() from type created like StringType(UTF8_BINARY). In order to preserve meaning of cast to STRING as default, I would need to add this flag. Another solution is to always treat user defined casts as implicit priority. @srielau what do you think of this type of behaviour for user defined casts?

@srielau
Copy link
Contributor

srielau commented May 17, 2024

@stefankandic SQL Standard does not support for casting to collated strings. I agree with the message thing. The standard only allows for CAST(1 AS STRING) COLLATE UNICODE, so that is what we were trying to achieve with this restriction on cast expression. The problem is that we have Spark Connect which can call col.cast(StringType(collation_name)) or col.cast("STRING COLLATE UNICODE") and we do not have an easy way of differentiating StringType() from type created like StringType(UTF8_BINARY). In order to preserve meaning of cast to STRING as default, I would need to add this flag. Another solution is to always treat user defined casts as implicit priority. @srielau what do you think of this type of behaviour for user defined casts?

Let me see if I get this right:
If we wanted to support CAST( .. AS STRING COLLATE...) this would naturally result in an collation with IMPLICIT priority.
(A CAST (.. AS STRING) would be DEFAULT, right?
The alternative would be to (semantically) rewrite it to CAST( .. AS STRING) COLLATE , and treat it as EXPLICIT.

Right?

@mihailom-db
Copy link
Contributor Author

mihailom-db commented May 17, 2024

@srielau I would expect that behaviour, but the problem comes with this rewriting you suggested. When we create expressions, CAST expression does not remember identifier used during parsing, but only type that that identifier produced. I managed to block SQL behaviour with the parsing rules, but pyspark, spark connect and dataframe API pose a problem. They allow for casting with types like StringType("collation_name") and StringType() and in python, as opposed to scala, we cannot differentiate StringType() and StringType("UTF8_BINARY") and protobuf always sends information as StringType("UTF8_BINARY"). Even if we only block SQL syntax problem arises if someone uses dataframe api because we can have session-level default collation changed and then collation resolution rules would be impossible to resolve, as that cast would be translated to cast(expression, StringType(session_level_default_collation)) and we have no way of differentiating whether this was created by STRING identifier or STRING COLLATE session_level_default_collation.

@mihailom-db
Copy link
Contributor Author

So in summary SQL will be fine, but we need to resolve dataframe API as well and think of the best unified solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants