New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) #46511
base: master
Are you sure you want to change the base?
Conversation
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked only at test cases. They looked good to me. The suggestions are made in order to make the test cases more readable.
Will go over the code separately.
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments on the code.
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
@@ -709,12 +774,24 @@ public void testLocate() throws SparkException { | |||
assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9); | |||
assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1); | |||
// Case-variable character length | |||
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2); | |
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY_LCASE", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the intent of this test case is here. If it is for UTF8_BINARY_LCASE then we have it again below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intent is: show that "\u0307" is found in "i̇" on the binary level (with respect to UTF8_BINARY)
but "\u0307" is not found in "İ" on the character level (with respect to UTF8_BINARY_LCASE)
@@ -709,12 +774,24 @@ public void testLocate() throws SparkException { | |||
assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9); | |||
assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1); | |||
// Case-variable character length | |||
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2); | |||
assertLocate("\u0307", "İ", 1, "UTF8_BINARY_LCASE", 0); // != UTF8_BINARY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not wrong, but the intent was a bit different than other cases: instead of comparing differences between UNICODE_CI and UTF8_BINARY_LCASE, I wanted to compare UTF8_BINARY and UTF8_BINARY_LCASE - essentially ensuring that the new UTF8_BINARY_LCASE (character-wise) searching logic does not
equal to applying UTF8_BINARY (byte-wise) searching logic on lowercased strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for patiently addressing my comments.
The PR looks good to me.
Please fill in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated PR description & title
@cloud-fan ready for review
I am reading this in the PR description:
This is not really the reason we we are making this change (conditional case mapping is the reason why we want to change the comparator for UTF8_BINARY_LCASE, but not related to string search directly). The real reason is that we need to do string search at the UT8 char-by-char level because otherwise indices returned or considered during string search can return strange and unusable results due to the fact that case mappings are one-to-many, that is one UTF8 character can map to two or more UTF8 characters under the case mapping (both for upper and lower casing). |
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final UTF8String substring | |||
} | |||
public static int execLowercase(final UTF8String string, final UTF8String substring, | |||
final int start) { | |||
return string.toLowerCase().indexOf(substring.toLowerCase(), start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm, the previous implementation here is correct, right?
int endPos) { | ||
assert endPos <= target.numChars(); | ||
for (int len = 0; len <= endPos; ++len) { | ||
if (target.substring(endPos - len, endPos).toLowerCase().equals(lowercasePattern)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to do this comparison with only one memory copy? Right now it's two: substring
and toLowerCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the details very well. Is it the same if we lower case the entire target string first, and then call substring
repeatedly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the same if we lower case the entire target string first
no, it's not - consider the example in the PR description: contains("İ", "i")
where haystack = "İ"
& needle = "i"
; then we have: lower(haystack) = lower("İ") = "i\u0307"
would produce substr(lower("İ"), 1, 1) = "i" = needle
however, we can see that these is no substr(haystack, start, len)
such that lower(substr(haystack, start, len)) == needle
- so these 2 behaviours are not actually equivalent (as we first assumed), and the reason for this discrepancy lies in conditional one-to-many case mapping (certain characters have lowercase equivalents consisted of multiple characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, we are focusing on correctness for now, and are aware of possible performance regression for string searching in UTF8_BINARY_LCASE - we intend to work on perf optimization in a subsequent PR
What changes were proposed in this pull request?
String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example:
contains("İ", "i");
now returns false, because there exists nostart, len
such thatlowercase(substring("İ", start, len)) == "i"
.Why are the changes needed?
Fix functions that give unusable results due to one-to-many conditional case mapping when performing string search under UTF8_BINARY_LCASE (see example above).
Does this PR introduce any user-facing change?
Yes, behaviour of
contains
,startswith
,endswith
, andlocate
/position
expressions is changed for edge cases with conditional case mapping.How was this patch tested?
New unit tests in
CollationSupportSuite
.Was this patch authored or co-authored using generative AI tooling?
No.