feat: Add support for complex types in native shuffle #1655

andygrove · 2025-04-17T15:15:25Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Use common supportedDataType method for columnar and native shuffle
Add TimestampNTZType as a supported type
Add fuzz test for shuffle that asserts that shuffle is native when experimental native scans are enabled

How are these changes tested?

andygrove · 2025-04-17T17:35:33Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2889,6 +2855,31 @@ object QueryPlanSerde extends Logging with CometExprShim {
    }
  }

+  def supportedShuffleDataType(dt: DataType): Boolean = dt match {
+    case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _: FloatType |
+        _: DoubleType | _: StringType | _: BinaryType | _: TimestampType | _: TimestampNTZType |


This code was moved and is not new. I added TimestampNTZType.

Do we have a test with TimestampNTZType for shuffle?

Yes, I found that TimestampNTZType was not supported because the test was initially failing. The fuzz test generates a file with all supported types (but maps are currently explicitly disabled in this test suite).

andygrove · 2025-04-17T17:36:25Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

-          _: DateType | _: BooleanType =>
-        true
-      case _ =>
-        // Native shuffle doesn't support struct/array yet


yes, it does! This method is removed and we now have a single supportedShuffleDataType method that is used for both native and columnar shuffle type checks.

Thanks for that, I was so confused about having this supported check in at least 3 places

comphead · 2025-04-17T18:20:46Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

+  /**
+   * Determine which data types are supported as hash-partition keys in a shuffle.
+   */
+  def supportedShufflePartitionDataType(dt: DataType): Boolean = dt match {


Suggested change

def supportedShufflePartitionDataType(dt: DataType): Boolean = dt match {

def supportedShufflePartitionKeyDataType(dt: DataType): Boolean = dt match {

I applied this change myself since I had to update the caller sites as well.

comphead · 2025-04-17T18:23:18Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2889,6 +2857,48 @@ object QueryPlanSerde extends Logging with CometExprShim {
    }
  }

+  /**
+   * Determine which data types are supported as hash-partition keys in a shuffle.


Suggested change

* Determine which data types are supported as hash-partition keys in a shuffle.

* Determine which data types are supported as hash-partition keys in a shuffle.

Hash Partition Key determines how data should be collocated for operations like `groupByKey`, `reduceByKey` or `join`

andygrove · 2025-04-17T20:00:02Z

I'm seeing a number of failures like this:

2025-04-17T19:12:54.7914570Z - columnar shuffle on struct including nulls *** FAILED *** (352 milliseconds)
2025-04-17T19:12:54.7916470Z   List() had length 0 instead of expected length 1 Sort [_1#4948 ASC NULLS FIRST], false, 0
2025-04-17T19:12:54.7917940Z   +- Exchange hashpartitioning(_1#4948, _2#4949, 10), REPARTITION_BY_NUM, [plan_id=13332]
2025-04-17T19:12:54.7919190Z      +- Filter (isnotnull(_1#4948) AND (_1#4948 > 1))
2025-04-17T19:12:54.7922160Z         +- FileScan parquet [_1#4948,_2#4949] Batched: true, DataFilters: [isnotnull(_1#4948), (_1#4948 > 1)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/runner/work/datafusion-comet/datafusion-comet/spark/target..., PartitionFilters: [], PushedFilters: [IsNotNull(_1), GreaterThan(_1,1)], ReadSchema: struct<_1:int,_2:struct<_1:int,_2:string>>

codecov-commenter · 2025-04-18T04:02:44Z

Codecov Report

Attention: Patch coverage is 88.46154% with 6 lines in your changes missing coverage. Please review.

Project coverage is 58.80%. Comparing base (f09f8af) to head (b5b4d27).
Report is 149 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	90.69%	0 Missing and 4 partials ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala	77.77%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1655      +/-   ##
============================================
+ Coverage     56.12%   58.80%   +2.68%     
- Complexity      976     1082     +106     
============================================
  Files           119      125       +6     
  Lines         11743    12592     +849     
  Branches       2251     2362     +111     
============================================
+ Hits           6591     7405     +814     
- Misses         4012     4015       +3     
- Partials       1140     1172      +32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-04-18T12:37:15Z

@parthchandra @mbutrovich Ths PR is now ready for review

mbutrovich · 2025-04-18T14:10:33Z

Shout out to @Kontinuation! #1511 removed a lot a of the custom logic in the shuffle writer that would have needed to be extended to support complex types. Instead we now rely on Arrow functions that already support complex types.

comphead

lgtm thanks @andygrove

kazuyukitanimura · 2025-04-18T19:40:59Z

spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala

@@ -161,6 +162,18 @@ class CometFuzzTestSuite extends CometTestBase with AdaptiveSparkPlanHelper {
    }
  }

+  test("shuffle") {
+    val df = spark.read.parquet(filename)


Does the data have complex type?

Yes, the data has arrays and structs but not maps yet

kazuyukitanimura · 2025-04-18T19:42:03Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2889,6 +2855,31 @@ object QueryPlanSerde extends Logging with CometExprShim {
    }
  }

+  def supportedShuffleDataType(dt: DataType): Boolean = dt match {
+    case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _: FloatType |
+        _: DoubleType | _: StringType | _: BinaryType | _: TimestampType | _: TimestampNTZType |


Do we have a test with TimestampNTZType for shuffle?

kazuyukitanimura · 2025-04-18T19:46:06Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

+   * Determine which data types are supported in a shuffle.
+   */
+  def supportedShuffleDataType(dt: DataType): Boolean = dt match {
+    case _: BooleanType => true


nit : BooleanType moved here alone because of the code style checks?

No. At one point I was seeing errors related to boolean and had made functional changes here that I later reverted.

I reverted the style change

andygrove · 2025-04-18T20:30:16Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

-            _: DoubleType | _: TimestampType | _: TimestampType | _: DecimalType | _: DateType =>
+            _: DoubleType | _: TimestampType | _: TimestampNTZType | _: DecimalType |


This is unrelated to the goal of the PR but I noticed we had TimestampType twice and no TimestampNTZType

add complex type support to shuffle

eec034a

andygrove force-pushed the shuffle-type-checks branch from 0365fb9 to eec034a Compare April 17, 2025 17:08

re-enable all fuzz tests

307beef

andygrove commented Apr 17, 2025

View reviewed changes

andygrove added 5 commits April 17, 2025 13:49

fix inadvertent refactor

ae416b3

fix regression

7811f61

fix

788801b

fix

4ff9f2e

fix

1707161

comphead reviewed Apr 17, 2025

View reviewed changes

address feedback

da79d4e

andygrove marked this pull request as ready for review April 17, 2025 19:02

fix

e9d0029

small refactor

a9aa537

comphead approved these changes Apr 18, 2025

View reviewed changes

kazuyukitanimura reviewed Apr 18, 2025

View reviewed changes

revert style change and fix a typo

b5b4d27

andygrove commented Apr 18, 2025

View reviewed changes

kazuyukitanimura approved these changes Apr 18, 2025

View reviewed changes

andygrove merged commit c04784a into apache:main Apr 19, 2025
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for complex types in native shuffle #1655

feat: Add support for complex types in native shuffle #1655

andygrove commented Apr 17, 2025 •

edited

Loading

andygrove Apr 17, 2025 •

edited

Loading

kazuyukitanimura Apr 18, 2025

andygrove Apr 18, 2025

andygrove Apr 17, 2025

comphead Apr 17, 2025

comphead Apr 17, 2025

andygrove Apr 17, 2025

comphead Apr 17, 2025

andygrove Apr 17, 2025

andygrove commented Apr 17, 2025

codecov-commenter commented Apr 18, 2025 •

edited

Loading

andygrove commented Apr 18, 2025

mbutrovich commented Apr 18, 2025

comphead left a comment

kazuyukitanimura Apr 18, 2025

andygrove Apr 18, 2025

kazuyukitanimura Apr 18, 2025

kazuyukitanimura Apr 18, 2025

andygrove Apr 18, 2025

andygrove Apr 18, 2025

andygrove Apr 18, 2025

	def supportedShufflePartitionDataType(dt: DataType): Boolean = dt match {
	def supportedShufflePartitionKeyDataType(dt: DataType): Boolean = dt match {

		_: DoubleType \| _: TimestampType \| _: TimestampType \| _: DecimalType \| _: DateType =>
		_: DoubleType \| _: TimestampType \| _: TimestampNTZType \| _: DecimalType \|

feat: Add support for complex types in native shuffle #1655

feat: Add support for complex types in native shuffle #1655

Conversation

andygrove commented Apr 17, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Apr 17, 2025

codecov-commenter commented Apr 18, 2025 • edited Loading

Codecov Report

andygrove commented Apr 18, 2025

mbutrovich commented Apr 18, 2025

comphead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Apr 17, 2025 •

edited

Loading

andygrove Apr 17, 2025 •

edited

Loading

codecov-commenter commented Apr 18, 2025 •

edited

Loading