Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) #46511

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented May 9, 2024

What changes were proposed in this pull request?

String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: contains("İ", "i"); now returns false, because there exists no start, len such that lowercase(substring("İ", start, len)) == "i".

Why are the changes needed?

Fix functions that give unusable results due to one-to-many conditional case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

Does this PR introduce any user-facing change?

Yes, behaviour of contains, startswith, endswith, and locate/position expressions is changed for edge cases with conditional case mapping.

How was this patch tested?

New unit tests in CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 9, 2024
@uros-db uros-db requested a review from mkaravel May 10, 2024 13:15
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked only at test cases. They looked good to me. The suggestions are made in order to make the test cases more readable.
Will go over the code separately.

Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments on the code.

@uros-db uros-db changed the title [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) May 15, 2024
@uros-db uros-db requested a review from mkaravel May 15, 2024 19:02
@@ -709,12 +774,24 @@ public void testLocate() throws SparkException {
assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9);
assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1);
// Case-variable character length
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY_LCASE", 2);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the intent of this test case is here. If it is for UTF8_BINARY_LCASE then we have it again below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intent is: show that "\u0307" is found in "i̇" on the binary level (with respect to UTF8_BINARY)

but "\u0307" is not found in "İ" on the character level (with respect to UTF8_BINARY_LCASE)

@@ -709,12 +774,24 @@ public void testLocate() throws SparkException {
assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9);
assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1);
// Case-variable character length
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);
assertLocate("\u0307", "İ", 1, "UTF8_BINARY_LCASE", 0); // != UTF8_BINARY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not wrong, but the intent was a bit different than other cases: instead of comparing differences between UNICODE_CI and UTF8_BINARY_LCASE, I wanted to compare UTF8_BINARY and UTF8_BINARY_LCASE - essentially ensuring that the new UTF8_BINARY_LCASE (character-wise) searching logic does not equal to applying UTF8_BINARY (byte-wise) searching logic on lowercased strings

@uros-db uros-db requested a review from mkaravel May 17, 2024 07:46
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for patiently addressing my comments.
The PR looks good to me.

@mkaravel
Copy link
Contributor

Please fill in the PR description.

@uros-db uros-db changed the title [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) May 21, 2024
Copy link
Contributor Author

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR description & title
@cloud-fan ready for review

@mkaravel
Copy link
Contributor

@uros-db

I am reading this in the PR description:

Why are the changes needed?
Fix functions that give unexpected results due to the conditional case mapping when performing string comparisons under UTF8_BINARY_LCASE.

This is not really the reason we we are making this change (conditional case mapping is the reason why we want to change the comparator for UTF8_BINARY_LCASE, but not related to string search directly).

The real reason is that we need to do string search at the UT8 char-by-char level because otherwise indices returned or considered during string search can return strange and unusable results due to the fact that case mappings are one-to-many, that is one UTF8 character can map to two or more UTF8 characters under the case mapping (both for upper and lower casing).

@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final UTF8String substring
}
public static int execLowercase(final UTF8String string, final UTF8String substring,
final int start) {
return string.toLowerCase().indexOf(substring.toLowerCase(), start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm, the previous implementation here is correct, right?

int endPos) {
assert endPos <= target.numChars();
for (int len = 0; len <= endPos; ++len) {
if (target.substring(endPos - len, endPos).toLowerCase().equals(lowercasePattern)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to do this comparison with only one memory copy? Right now it's two: substring and toLowerCase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the details very well. Is it the same if we lower case the entire target string first, and then call substring repeatedly?

Copy link
Contributor Author

@uros-db uros-db May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the same if we lower case the entire target string first

no, it's not - consider the example in the PR description: contains("İ", "i") where haystack = "İ" & needle = "i"; then we have: lower(haystack) = lower("İ") = "i\u0307" would produce substr(lower("İ"), 1, 1) = "i" = needle

however, we can see that these is no substr(haystack, start, len) such that lower(substr(haystack, start, len)) == needle - so these 2 behaviours are not actually equivalent (as we first assumed), and the reason for this discrepancy lies in conditional one-to-many case mapping (certain characters have lowercase equivalents consisted of multiple characters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, we are focusing on correctness for now, and are aware of possible performance regression for string searching in UTF8_BINARY_LCASE - we intend to work on perf optimization in a subsequent PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants