Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46526

Closed
wants to merge 44 commits into from

Conversation

GideonPotok
Copy link
Contributor

@GideonPotok GideonPotok commented May 10, 2024

What changes were proposed in this pull request?

SPARK-47353

Pull requests

Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
Comparing Experimental Approaches

Central Change to Mode eval Algorithm:

  • Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a grouping
buff.toSeq.groupMapReduce {
        case (key: String, _) =>
          CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
        case (key: UTF8String, _) =>
          CollationFactory.getCollationKey(key, collationId)
        case (key, _) => key
      }(x => x)((x, y) => (x._1, x._2 + y._2)).values

Minor Change to Mode:

  • Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite to test new functionality including:

  • Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

  • Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

  1. Ensures consistency in handling string comparisons under various collation settings.
  2. Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

  1. Adds a new collationEnabled property to the Mode expression.
  2. Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

  1. Unit Tests:
    • CollationStringExpressionsSuite:
      • WIP: Make the newly added tests more in the same design pattern as the existing tests
    • Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.

Tests do not need to include Null Handling.

  1. Benchmark Tests:

  2. Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/
  1. Continuous Integration (CI):
    • The patch passed all relevant Continuous Integration (CI) checks, including:
      • Unit test suite
      • Benchmark suite [TODO RUN AND ATTACH RESULTS]

Was this patch authored or co-authored using generative AI tooling?

No, but the PR Description was co-authored with Chat-GPT.

What is left to do before taking off the WIP tag?

  • Discuss/analyze performance in PR Description
  • Make sure benchmarks invoke the function enough times to have non-zero values in the benchmark results
  • Run benchmarks in GHA and add resultant benchmark reports to Pull Request
  • Make the newly added tests more in the same design pattern as the existing tests
  • Make the newly added tests cover all collations
  • Figure out within group return value (EG if lower case collation, between Seq(("Def"), ("def"), ("def"), ("DEF"), ("abc"), ("abc"), ("abc")), is Def, DEF, or def the answer?
  • Choices are
  1. First in data frame (Def)
  2. Within group Mode:
  • 3. Arbitrary/ Undefined
  1. Physical ordering
    Need to check but I think it is arbitrary in mode

@github-actions github-actions bot added the SQL label May 10, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that these results are somewhat better, but still not too good

However - what's imperative right now is that we preserve the performance for UTF8_BINARY (by doing that if/else branch on supportsBinaryEquality). If we don't have a better approach at this moment, and we've already tried a couple of things - then I would say that's fine and we can proceed with the best of what we've got (which is this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Note, by the way that because we are relying on supportsBinaryEquality, this is about preserving not only the performance for UTF8_BINARY, but also that of UNICODE
  2. @uros-db check again. I believe the benchmarks are slightly more realistic now. In that for each string there are 3-6 that are equal by collation. EG:
 collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ---------------------------------------------------------------------------------------------------------------------------------
 UTF8_BINARY_LCASE - mode - 30105 elements                      6              6           0          5.1         195.6       1.0X
 UNICODE - mode - 30105 elements                                3              3           0         11.6          86.0       2.3X
 UTF8_BINARY - mode - 30105 elements                            3              3           0         11.6          85.9       2.3X
 UNICODE_CI - mode - 30105 elements                            12             12           1          2.6         382.9       0.5X
  1. Still a slowdown (though there is more work, so how would we expect any different). I willl run these new benchmarks on the other approaches and assuming this one is best, we can get this one ready for final stages of cleanup and review...

  2. I will leave one benchmark for mode rather then having three for different input sizes... that was just a temporary setup.

  3. An idea would be for the case of UTF8_BINARY and UNICODE to go through the lower operation first. This would be a better way to check that, as the design doc instructs: "Performance regression for case insensitive collation must be no worse than using upper() or ilike() explicitly" . Let me know whether to change the benchmark accordingly. There will probably still be a performance degradation, but it would at least be a fairer comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apropos going through lower first, we need to be careful so as not to destroy the original data

consider an example of finding the mode in ['a', 'B', 'B', 'A']. Here, correct answers would be:

  • 'a'
  • 'A'
  • 'B'

but NOT:
'b'

because that value is not found in the original data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apropos altering the benchmark to yield better results for this particular expression, I'm not sure if that's something we should encourage - the benchmark is not perfect and should only be used for rough estimates. it's fine to consider the worst case scenario (all different elements), and I think we should look for the best approach anyways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all things considered, I would say proceed with this approach - clean everything up and get it running with your tests and all CI checks in order. then we can call in other reviewers and see where it goes

modeMap
} else {
buff
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to the original issue (why Mode doesn't work already, while otherwise Aggregation generally works with collated strings in Spark), here's what I'm interested in: why does PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]] not work here automatically?

afaik, ordering for PhysicalStringType is defined correctly:

private[sql] val ordering = CollationFactory.fetchCollation(collationId).comparator.compare(_, _)

so one would naturally expect Mode to work "as is"

did you investigate this maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db While the ordering for PhysicalStringType is correctly established using private[sql] val ordering = CollationFactory.fetchCollation(collationId).comparator.compare(_, _), this does not automatically resolve the issue with Mode. To illustrate, consider the example of UTF8_BINARY_LCASE where an input like Map("a" -> 3L, "b" -> 2L, "B" -> 2L) results in evaluating the maximum over the tuples (2L, "B"), (2L, "b"), (3L, "a") rather than the expected (3L, "a"), (4L, "b"). This indicates that the current approach doesn't aggregate values as required for Mode to operate correctly. Unit tests confirm that Mode otherwise won't work for such cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants