New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches #46488
Draft
GideonPotok
wants to merge
21
commits into
apache:master
Choose a base branch
from
GideonPotok:spark_47353_2_experimental
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+324
−13
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
collagtion enabled or not Factored in null count.: benchmark benchmark ready for review ready for review ready for review ready for review ready for review ready for review use collation id tests pass tidy implementation idea: tree map tests tests support mode eval test passes
…st can then be removed
…st can then be removed
6 tasks
GideonPotok
changed the title
Spark 47353 2 experimental
[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using new hashing function
May 8, 2024
GideonPotok
changed the title
[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using new hashing function
[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches
May 10, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here is the PR description for the alternative PR:
PR Description
Introduction
This PR proposes an alternative approach to the original implementation using
TreeMap
with a custom comparator for collation-sensitive grouping. The primary objective is to improve performance by leveragingOpenHashMap
with a custom hashing strategy.Benchmark Results
The initial
TreeMap
approach led to significant performance degradation, especially for unicode collations. After implementing a proof of concept usingOpenHashMap
, the slowdown was reduced considerably.Benchmark Results Overview
TreeMap
OpenHashMap
Details:
OpenHashMap
Implementation: Slowdown ranges from 9.5x to 15xProposed Implementation
Hasher
to allow collation-sensitive grouping.org.apache.spark.util.collection.OpenHashSet.Hasher.hash()
, specifically an override ofhash()
withinHasher[String with Collation]
.Hasher
branches to an alternative hashing method that is collation-sensitive.Approach 3 (Prototype)
In addition to
TreeMap
andOpenHashMap
, a third approach has been introduced:Next Steps
OpenJDK Benchmark Results
Configuration:
Collation Unit Benchmarks - Mode - 2000 Elements
Collation Unit Benchmarks - Mode - 4000 Elements
Thought: These benchmarks aren't on populations where the collation makes any difference.
OpenJDK Benchmark Results
Environment
Collation Unit Benchmarks (6000 Elements)
Collation Unit Benchmarks (12000 Elements)