New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47972][SQL] Restrict CAST expression for collations #46474
base: master
Are you sure you want to change the base?
Conversation
|
||
checkAnswer(sql(s"SELECT cast(1 as string)"), Seq(Row("1"))) | ||
checkAnswer(sql(s"SELECT cast('A' as string)"), Seq(Row("A"))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering, how does this cope with something like:
withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
checkAnswer(sql(s"SELECT cast(1 as string)"), Seq(Row("1")))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a test for this, but it should return a string "1" collated with UNICODE collation. The aim of this PR is to block usage of STRING COLLATE collation_name
in cast, all previous usage for default collation stay the same.
@dbatomic @stefankandic @stevomitric @nikolamand-db could you take a look at this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the approach of adding this flag to StringType
is the way to go, then I'd say lgtm
@mihailom-db perhaps you could discuss below the pros/cons of this versus just blocking cast w/ collate in sql
Why do we want to do this? Should it be a parser error? Also, I don't think the error message is very clear - we should at least let the user know how to perform the cast to another collation. |
case Seq(_) => SqlApiConf.get.defaultStringType | ||
case Seq(_) => | ||
val st = SqlApiConf.get.defaultStringType | ||
st.createdAsNonCollated = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid mutating the state of the defaultStringType?
@stefankandic SQL Standard does not support for casting to collated strings. I agree with the message thing. The standard only allows for CAST(1 AS STRING) COLLATE UNICODE, so that is what we were trying to achieve with this restriction on cast expression. The problem is that we have Spark Connect which can call col.cast(StringType(collation_name)) or col.cast("STRING COLLATE UNICODE") and we do not have an easy way of differentiating StringType() from type created like StringType(UTF8_BINARY). In order to preserve meaning of cast to STRING as default, I would need to add this flag. Another solution is to always treat user defined casts as implicit priority. @srielau what do you think of this type of behaviour for user defined casts? |
Let me see if I get this right: Right? |
@srielau I would expect that behaviour, but the problem comes with this rewriting you suggested. When we create expressions, CAST expression does not remember identifier used during parsing, but only type that that identifier produced. I managed to block SQL behaviour with the parsing rules, but pyspark, spark connect and dataframe API pose a problem. They allow for casting with types like StringType("collation_name") and StringType() and in python, as opposed to scala, we cannot differentiate StringType() and StringType("UTF8_BINARY") and protobuf always sends information as StringType("UTF8_BINARY"). Even if we only block SQL syntax problem arises if someone uses dataframe api because we can have session-level default collation changed and then collation resolution rules would be impossible to resolve, as that cast would be translated to cast(expression, StringType(session_level_default_collation)) and we have no way of differentiating whether this was created by STRING identifier or STRING COLLATE session_level_default_collation. |
So in summary SQL will be fine, but we need to resolve dataframe API as well and think of the best unified solution. |
What changes were proposed in this pull request?
Block of syntax CAST(value AS STRING COLLATE collation_name).
Why are the changes needed?
Current state of code allows for calls like CAST(1 AS STRING COLLATE UNICODE). We want to restrict CAST expression to only be able to cast to default collation string, and to only allow COLLATE expression to produce explicitly collated strings.
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Test in CollationSuite.
Was this patch authored or co-authored using generative AI tooling?
No.