Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

GideonPotok
Copy link
Contributor

@GideonPotok GideonPotok commented May 8, 2024

Here is the PR description for the alternative PR:


PR Description

Introduction

This PR proposes an alternative approach to the original implementation using TreeMap with a custom comparator for collation-sensitive grouping. The primary objective is to improve performance by leveraging OpenHashMap with a custom hashing strategy.

Benchmark Results

The initial TreeMap approach led to significant performance degradation, especially for unicode collations. After implementing a proof of concept using OpenHashMap, the slowdown was reduced considerably.

Benchmark Results Overview

Approach Slowdown (UTF8_BINARY Reference)
TreeMap 16.9x - 56x
OpenHashMap 9.5x - 15x

Details:

  • UTF8_BINARY (Baseline)
    • Red-Black Tree / TreeMap Implementation: Slowdown ranges from 16.9x to 56x
    • OpenHashMap Implementation: Slowdown ranges from 9.5x to 15x
  • The numerical benchmark case has been ignored for this analysis.

Proposed Implementation

  • Custom Hasher: Introduced a custom Hasher to allow collation-sensitive grouping.
    • Modified org.apache.spark.util.collection.OpenHashSet.Hasher.hash(), specifically an override of hash() within Hasher[String with Collation].
    • This new Hasher branches to an alternative hashing method that is collation-sensitive.
  • Benchmark Enhancement: Benchmarks now include multiple collation types and evaluate different algorithms.

Approach 3 (Prototype)
In addition to TreeMap and OpenHashMap, a third approach has been introduced:

private def impl3(buff: OpenHashMap[AnyRef, Long]) = {
  val modeMap = buff.toSeq.groupMapReduce {
    case (key: String, _) =>
      CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
    case (key: UTF8String, _) =>
      CollationFactory.getCollationKey(key, collationId)
    case (key, _) => key
  }(x => x)((x, y) => (x._1, x._2 + y._2)).values
  modeMap
}

Next Steps

  • Cardinality and Duplicates: Further analysis needed to determine expected cardinalities in realistic scenarios. For example, even in large datasets with 1 trillion rows, it's unlikely that all rows will have unique values.
  • Benchmark Diversity: More varied benchmarks needed, especially in the experimental branch, to better evaluate different algorithms with varied duplicate values.

OpenJDK Benchmark Results

Configuration:

  • Java Version: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
  • OS: Mac OS X 14.4.1
  • CPU: Apple M3 Max

Collation Unit Benchmarks - Mode - 2000 Elements

Mode Map Type Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
Numerical Type N/A 0 0 0 30.5 32.8 9.7X
UTF8_BINARY N/A 0 0 0 16.4 60.8 5.2X
UTF8_BINARY_LCASE treemap 1 1 0 3.1 318.3 1.0X
UNICODE treemap 2 2 0 1.2 829.7 0.4X
UNICODE_CI treemap 2 2 0 1.1 877.5 0.4X
UTF8_BINARY_LCASE hashmap 1 1 0 3.4 298.3 1.1X
UNICODE hashmap 1 1 0 2.2 452.9 0.7X
UNICODE_CI hashmap 1 1 0 2.1 467.3 0.7X
UTF8_BINARY_LCASE mapreduce 0 0 0 5.3 187.6 1.7X
UNICODE mapreduce 0 0 0 5.9 169.6 1.9X
UNICODE_CI mapreduce 1 1 0 3.1 327.2 1.0X

Collation Unit Benchmarks - Mode - 4000 Elements

Mode Map Type Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 2 2 0 2.4 409.3 1.0X
UNICODE treemap 5 5 0 0.8 1190.5 0.3X
UTF8_BINARY N/A 0 0 0 14.4 69.2 5.9X
UNICODE_CI treemap 5 5 0 0.8 1213.4 0.3X
Numerical Type N/A 0 0 0 24.5 40.8 10.0X
UTF8_BINARY_LCASE hashmap 1 1 0 3.4 295.9 1.4X
UNICODE hashmap 2 2 0 1.8 556.8 0.7X
UNICODE_CI hashmap 2 2 0 2.2 459.8 0.9X
UTF8_BINARY_LCASE mapreduce 1 1 0 4.5 220.3 1.9X
UNICODE mapreduce 1 1 0 5.1 196.4 2.1X
UNICODE_CI mapreduce 1 2 0 2.7 364.8 1.1X
Mode Map Type Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 4 4 0 2.1 473.8 1.0X
UNICODE treemap 11 12 1 0.7 1385.7 0.3X
UTF8_BINARY treemap 1 1 0 13.9 71.7 6.6X
UNICODE_CI treemap 11 11 0 0.7 1348.2 0.4X
Numerical Type treemap 0 0 0 22.7 44.0 10.8X
UTF8_BINARY_LCASE hashmap 3 3 0 2.9 339.9 1.4X
UNICODE hashmap 5 5 0 1.8 568.0 0.8X
UTF8_BINARY hashmap 1 1 0 13.5 74.3 6.4X
UNICODE_CI hashmap 4 4 0 2.1 470.2 1.0X
Numerical Type hashmap 0 0 0 23.2 43.2 11.0X
UTF8_BINARY_LCASE mapreduce 2 2 0 3.9 255.7 1.9X
UNICODE mapreduce 2 2 0 4.3 230.2 2.1X
UTF8_BINARY mapreduce 1 1 0 14.8 67.5 7.0X
UNICODE_CI mapreduce 3 3 0 2.5 396.6 1.2X
Numerical Type mapreduce 0 0 0 23.7 42.2 11.2X

Thought: These benchmarks aren't on populations where the collation makes any difference.

OpenJDK Benchmark Results

Environment

  • JVM: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
  • OS: Mac OS X 14.4.1
  • Hardware: Apple M3 Max

Collation Unit Benchmarks (6000 Elements)

Mode Implementation Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 3 3 0 2.2 456.0 1.0X
UNICODE treemap 8 8 0 0.8 1274.2 0.4X
UTF8_BINARY treemap 0 1 1 14.1 71.0 6.4X
UNICODE_CI treemap 8 8 0 0.8 1273.0 0.4X
Numerical Type N/A 0 0 0 23.8 42.0 10.9X
UTF8_BINARY_LCASE hashmap 2 3 2 2.7 368.3 1.2X
UNICODE_CI hashmap 3 4 0 1.8 550.0 0.8X
UTF8_BINARY_LCASE mapreduce 1 2 0 4.1 242.4 1.9X
UNICODE_CI mapreduce 2 3 0 2.6 390.8 1.2X
UTF8_BINARY_LCASE treemap 0.05% 3 3 0 1.9 516.8 0.9X
UTF8_BINARY treemap 0.05% 0 1 0 13.2 76.0 6.0X
UNICODE_CI treemap 0.05% 8 9 2 0.8 1322.2 0.3X
UTF8_BINARY_LCASE hashmap 0.05% 2 2 0 2.8 362.8 1.3X
UNICODE_CI hashmap 0.05% 3 4 0 1.9 530.0 0.9X
UTF8_BINARY_LCASE mapreduce 0.05% 2 2 0 3.8 263.1 1.7X
UNICODE_CI mapreduce 0.05% 2 3 0 2.7 375.7 1.2X
UTF8_BINARY_LCASE treemap 0.1% 3 3 0 1.9 521.4 0.9X
UTF8_BINARY treemap 0.1% 0 1 0 12.6 79.4 5.7X
UNICODE_CI treemap 0.1% 8 9 1 0.7 1346.7 0.3X
UTF8_BINARY_LCASE hashmap 0.1% 2 2 0 2.8 356.8 1.3X
UNICODE_CI hashmap 0.1% 3 4 0 1.9 516.3 0.9X
UTF8_BINARY_LCASE mapreduce 0.1% 2 2 0 3.7 269.0 1.7X
UNICODE_CI mapreduce 0.1% 2 3 0 2.6 388.0 1.2X

Collation Unit Benchmarks (12000 Elements)

Mode Implementation Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 7 7 0 1.8 557.2 1.0X
UNICODE treemap 18 20 1 0.7 1492.5 0.4X
UTF8_BINARY treemap 1 1 0 11.9 84.3 6.6X
UNICODE_CI treemap 17 20 1 0.7 1433.1 0.4X
Numerical Type N/A 1 1 0 21.7 46.1 12.1X
UTF8_BINARY_LCASE hashmap 5 5 0 2.6 379.8 1.5X
UNICODE_CI hashmap 7 7 1 1.8 571.0 1.0X
UTF8_BINARY_LCASE mapreduce 3 4 0 3.6 279.3 2.0X
UNICODE_CI mapreduce 5 5 0 2.4 411.4 1.4X
UTF8_BINARY_LCASE treemap 0.05% 7 7 0 1.7 585.1 1.0X
UTF8_BINARY treemap 0.05% 1 1 0 10.8 92.5 6.0X
UNICODE_CI treemap 0.05% 17 19 1 0.7 1454.2 0.4X
UTF8_BINARY_LCASE hashmap 0.05% 5 5 0 2.5 400.6 1.4X
UNICODE_CI hashmap 0.05% 7 8 1 1.7 593.0 0.9X
UTF8_BINARY_LCASE mapreduce 0.05% 4 4 0 3.3 304.8 1.8X
UNICODE_CI mapreduce 0.05% 5 6 1 2.3 433.9 1.3X
UTF8_BINARY_LCASE treemap 0.1% 7 8 0 1.7 591.2 0.9X
UTF8_BINARY treemap 0.1% 1 1 0 11.1 90.5 6.2X
UNICODE_CI treemap 0.1% 18 19 1 0.7 1482.1 0.4X
UTF8_BINARY_LCASE hashmap 0.1% 4 5 0 2.7 373.5 1.5X
UNICODE_CI hashmap 0.1% 7 7 0 1.7 573.7 1.0X
UTF8_BINARY_LCASE mapreduce 0.1% 4 4 0 3.3 300.8 1.9X
UNICODE_CI mapreduce 0.1% 5 6 0 2.3 437.3 1.3X

@github-actions github-actions bot added the SQL label May 8, 2024
@GideonPotok GideonPotok changed the title Spark 47353 2 experimental [WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using new hashing function May 8, 2024
@GideonPotok GideonPotok changed the title [WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using new hashing function [WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
1 participant