[Draft] Add debugging operator to identify skew in datasets inline #46490
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This introduces a new debugging method to identify which values are producing skew in a dataset. I wrote it with joins in mind, but it can be used for any dataset.
Why are the changes needed?
Debugging skew join is a pain point. Once the skew is identified from the metrics (e.g. one partition has much more data), it is cumbersome to figure out exactly where that skew is coming from. Usually you need to write a manual aggregation and write the output to the logs. This streamlines the process by adding a method
inlineColumnsCount
to do this automatically, inferring the join keys to count.The current solution still requires a code change. A useful extension to this would be to add a configuration like
spark.sql.join.countkeyvals=true
so this feature can be enabled without any code changes. It would have to be enabled for all joins, but for debugging that is okay. This can be useful for prod jobs and more generally anytime uploading a code change to test is a longer iteration cycle.I would also like to add a SQL hint for pure SQL jobs.
Does this PR introduce any user-facing change?
Yes. It adds a new operator to count the value occurrence for a combination of columns. The counts will be periodically written to the executor logs.
How was this patch tested?
Manual testing:
Here is an example of what the plan looks like:
Was this patch authored or co-authored using generative AI tooling?
No