[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488

GideonPotok · 2024-05-08T21:51:32Z

Here is the PR description for the alternative PR:

PR Description

Introduction

This PR proposes an alternative approach to the original implementation using TreeMap with a custom comparator for collation-sensitive grouping. The primary objective is to improve performance by leveraging OpenHashMap with a custom hashing strategy.

Benchmark Results

The initial TreeMap approach led to significant performance degradation, especially for unicode collations. After implementing a proof of concept using OpenHashMap, the slowdown was reduced considerably.

Benchmark Results Overview

Approach	Slowdown (UTF8_BINARY Reference)
`TreeMap`	16.9x - 56x
`OpenHashMap`	9.5x - 15x

Details:

UTF8_BINARY (Baseline)
- Red-Black Tree / TreeMap Implementation: Slowdown ranges from 16.9x to 56x
- OpenHashMap Implementation: Slowdown ranges from 9.5x to 15x
The numerical benchmark case has been ignored for this analysis.

Proposed Implementation

Custom Hasher: Introduced a custom Hasher to allow collation-sensitive grouping.
- Modified org.apache.spark.util.collection.OpenHashSet.Hasher.hash(), specifically an override of hash() within Hasher[String with Collation].
- This new Hasher branches to an alternative hashing method that is collation-sensitive.
Benchmark Enhancement: Benchmarks now include multiple collation types and evaluate different algorithms.

Approach 3 (Prototype)
In addition to TreeMap and OpenHashMap, a third approach has been introduced:

private def impl3(buff: OpenHashMap[AnyRef, Long]) = {
  val modeMap = buff.toSeq.groupMapReduce {
    case (key: String, _) =>
      CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
    case (key: UTF8String, _) =>
      CollationFactory.getCollationKey(key, collationId)
    case (key, _) => key
  }(x => x)((x, y) => (x._1, x._2 + y._2)).values
  modeMap
}

Next Steps

Cardinality and Duplicates: Further analysis needed to determine expected cardinalities in realistic scenarios. For example, even in large datasets with 1 trillion rows, it's unlikely that all rows will have unique values.
Benchmark Diversity: More varied benchmarks needed, especially in the experimental branch, to better evaluate different algorithms with varied duplicate values.

OpenJDK Benchmark Results

Configuration:

Java Version: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
OS: Mac OS X 14.4.1
CPU: Apple M3 Max

Collation Unit Benchmarks - Mode - 2000 Elements

Mode	Map Type	Best Time (ms)	Avg Time (ms)	Rate (M/s)	Per Row (ns)	Relative
Numerical Type	N/A	0	0	30.5	32.8	9.7X
UTF8_BINARY	N/A	0	0	16.4	60.8	5.2X
UTF8_BINARY_LCASE	treemap	1	1	3.1	318.3	1.0X
UNICODE	treemap	2	2	1.2	829.7	0.4X
UNICODE_CI	treemap	2	2	1.1	877.5	0.4X
UTF8_BINARY_LCASE	hashmap	1	1	3.4	298.3	1.1X
UNICODE	hashmap	1	1	2.2	452.9	0.7X
UNICODE_CI	hashmap	1	1	2.1	467.3	0.7X
UTF8_BINARY_LCASE	mapreduce	0	0	5.3	187.6	1.7X
UNICODE	mapreduce	0	0	5.9	169.6	1.9X
UNICODE_CI	mapreduce	1	1	3.1	327.2	1.0X

Collation Unit Benchmarks - Mode - 4000 Elements

Mode	Map Type	Best Time (ms)	Avg Time (ms)	Rate (M/s)	Per Row (ns)	Relative
UTF8_BINARY_LCASE	treemap	2	2	2.4	409.3	1.0X
UNICODE	treemap	5	5	0.8	1190.5	0.3X
UTF8_BINARY	N/A	0	0	14.4	69.2	5.9X
UNICODE_CI	treemap	5	5	0.8	1213.4	0.3X
Numerical Type	N/A	0	0	24.5	40.8	10.0X
UTF8_BINARY_LCASE	hashmap	1	1	3.4	295.9	1.4X
UNICODE	hashmap	2	2	1.8	556.8	0.7X
UNICODE_CI	hashmap	2	2	2.2	459.8	0.9X
UTF8_BINARY_LCASE	mapreduce	1	1	4.5	220.3	1.9X
UNICODE	mapreduce	1	1	5.1	196.4	2.1X
UNICODE_CI	mapreduce	1	2	2.7	364.8	1.1X

Mode	Map Type	Best Time (ms)	Avg Time (ms)	Stdev (ms)	Rate (M/s)	Per Row (ns)	Relative
UTF8_BINARY_LCASE	treemap	4	4	0	2.1	473.8	1.0X
UNICODE	treemap	11	12	1	0.7	1385.7	0.3X
UTF8_BINARY	treemap	1	1	0	13.9	71.7	6.6X
UNICODE_CI	treemap	11	11	0	0.7	1348.2	0.4X
Numerical Type	treemap	0	0	0	22.7	44.0	10.8X
UTF8_BINARY_LCASE	hashmap	3	3	0	2.9	339.9	1.4X
UNICODE	hashmap	5	5	0	1.8	568.0	0.8X
UTF8_BINARY	hashmap	1	1	0	13.5	74.3	6.4X
UNICODE_CI	hashmap	4	4	0	2.1	470.2	1.0X
Numerical Type	hashmap	0	0	0	23.2	43.2	11.0X
UTF8_BINARY_LCASE	mapreduce	2	2	0	3.9	255.7	1.9X
UNICODE	mapreduce	2	2	0	4.3	230.2	2.1X
UTF8_BINARY	mapreduce	1	1	0	14.8	67.5	7.0X
UNICODE_CI	mapreduce	3	3	0	2.5	396.6	1.2X
Numerical Type	mapreduce	0	0	0	23.7	42.2	11.2X

Thought: These benchmarks aren't on populations where the collation makes any difference.

OpenJDK Benchmark Results

Environment

JVM: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
OS: Mac OS X 14.4.1
Hardware: Apple M3 Max

Collation Unit Benchmarks (6000 Elements)

Mode	Implementation	Best Time (ms)	Avg Time (ms)	Stdev (ms)	Rate (M/s)	Per Row (ns)	Relative
UTF8_BINARY_LCASE	treemap	3	3	0	2.2	456.0	1.0X
UNICODE	treemap	8	8	0	0.8	1274.2	0.4X
UTF8_BINARY	treemap	0	1	1	14.1	71.0	6.4X
UNICODE_CI	treemap	8	8	0	0.8	1273.0	0.4X
Numerical Type	N/A	0	0	0	23.8	42.0	10.9X
UTF8_BINARY_LCASE	hashmap	2	3	2	2.7	368.3	1.2X
UNICODE_CI	hashmap	3	4	0	1.8	550.0	0.8X
UTF8_BINARY_LCASE	mapreduce	1	2	0	4.1	242.4	1.9X
UNICODE_CI	mapreduce	2	3	0	2.6	390.8	1.2X
UTF8_BINARY_LCASE	treemap 0.05%	3	3	0	1.9	516.8	0.9X
UTF8_BINARY	treemap 0.05%	0	1	0	13.2	76.0	6.0X
UNICODE_CI	treemap 0.05%	8	9	2	0.8	1322.2	0.3X
UTF8_BINARY_LCASE	hashmap 0.05%	2	2	0	2.8	362.8	1.3X
UNICODE_CI	hashmap 0.05%	3	4	0	1.9	530.0	0.9X
UTF8_BINARY_LCASE	mapreduce 0.05%	2	2	0	3.8	263.1	1.7X
UNICODE_CI	mapreduce 0.05%	2	3	0	2.7	375.7	1.2X
UTF8_BINARY_LCASE	treemap 0.1%	3	3	0	1.9	521.4	0.9X
UTF8_BINARY	treemap 0.1%	0	1	0	12.6	79.4	5.7X
UNICODE_CI	treemap 0.1%	8	9	1	0.7	1346.7	0.3X
UTF8_BINARY_LCASE	hashmap 0.1%	2	2	0	2.8	356.8	1.3X
UNICODE_CI	hashmap 0.1%	3	4	0	1.9	516.3	0.9X
UTF8_BINARY_LCASE	mapreduce 0.1%	2	2	0	3.7	269.0	1.7X
UNICODE_CI	mapreduce 0.1%	2	3	0	2.6	388.0	1.2X

Collation Unit Benchmarks (12000 Elements)

Mode	Implementation	Best Time (ms)	Avg Time (ms)	Stdev (ms)	Rate (M/s)	Per Row (ns)	Relative
UTF8_BINARY_LCASE	treemap	7	7	0	1.8	557.2	1.0X
UNICODE	treemap	18	20	1	0.7	1492.5	0.4X
UTF8_BINARY	treemap	1	1	0	11.9	84.3	6.6X
UNICODE_CI	treemap	17	20	1	0.7	1433.1	0.4X
Numerical Type	N/A	1	1	0	21.7	46.1	12.1X
UTF8_BINARY_LCASE	hashmap	5	5	0	2.6	379.8	1.5X
UNICODE_CI	hashmap	7	7	1	1.8	571.0	1.0X
UTF8_BINARY_LCASE	mapreduce	3	4	0	3.6	279.3	2.0X
UNICODE_CI	mapreduce	5	5	0	2.4	411.4	1.4X
UTF8_BINARY_LCASE	treemap 0.05%	7	7	0	1.7	585.1	1.0X
UTF8_BINARY	treemap 0.05%	1	1	0	10.8	92.5	6.0X
UNICODE_CI	treemap 0.05%	17	19	1	0.7	1454.2	0.4X
UTF8_BINARY_LCASE	hashmap 0.05%	5	5	0	2.5	400.6	1.4X
UNICODE_CI	hashmap 0.05%	7	8	1	1.7	593.0	0.9X
UTF8_BINARY_LCASE	mapreduce 0.05%	4	4	0	3.3	304.8	1.8X
UNICODE_CI	mapreduce 0.05%	5	6	1	2.3	433.9	1.3X
UTF8_BINARY_LCASE	treemap 0.1%	7	8	0	1.7	591.2	0.9X
UTF8_BINARY	treemap 0.1%	1	1	0	11.1	90.5	6.2X
UNICODE_CI	treemap 0.1%	18	19	1	0.7	1482.1	0.4X
UTF8_BINARY_LCASE	hashmap 0.1%	4	5	0	2.7	373.5	1.5X
UNICODE_CI	hashmap 0.1%	7	7	0	1.7	573.7	1.0X
UTF8_BINARY_LCASE	mapreduce 0.1%	4	4	0	3.3	300.8	1.9X
UNICODE_CI	mapreduce 0.1%	5	6	0	2.3	437.3	1.3X

collagtion enabled or not Factored in null count.: benchmark benchmark ready for review ready for review ready for review ready for review ready for review ready for review use collation id tests pass tidy implementation idea: tree map tests tests support mode eval test passes

…st can then be removed

GideonPotok added 17 commits May 7, 2024 08:41

fixed up tests according to expectations surrounding nulls.

0f0eedf

mode benchmark

e1a533f

mode benchmark

a98eebe

remove class member for collatrion enabled

e4bf907

remove class member for collatrion enabled

c89af54

remove class member for collatrion enabled

794b20a

dataType check can be incorporated into the previous test, so this te…

3e891df

…st can then be removed

dataType check can be incorporated into the previous test, so this te…

0849a21

…st can then be removed

scalastyle

e79e14e

scalastyle

9a32243

fix add back ._1

506f7fc

fix up

16ed98f

tests pass

6891c9b

tests pass

3a0edac

Experimental

f96fb02

tests pass

b05e172

github-actions bot added the SQL label May 8, 2024

GideonPotok mentioned this pull request May 8, 2024

[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using a scala TreeMap (RB Tree) #46404

Open

6 tasks

GideonPotok changed the title ~~Spark 47353 2 experimental~~ [WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using new hashing function May 8, 2024

GideonPotok added 4 commits May 9, 2024 09:24

third approach

7203a44

third approach

07bcb7a

third approach

496a2f3

third approach

3a2b586

This was referenced May 10, 2024

[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46526

Closed

[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488

[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488

GideonPotok commented May 8, 2024 •

edited

[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488

Are you sure you want to change the base?

[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488

Conversation

GideonPotok commented May 8, 2024 • edited

PR Description

Introduction

Benchmark Results

Proposed Implementation

Next Steps

OpenJDK Benchmark Results

Collation Unit Benchmarks - Mode - 2000 Elements

Collation Unit Benchmarks - Mode - 4000 Elements

OpenJDK Benchmark Results

Environment

Collation Unit Benchmarks (6000 Elements)

Collation Unit Benchmarks (12000 Elements)

GideonPotok commented May 8, 2024 •

edited