fix: add proto roundtrips for Spark tests and fix issues it surfaces #315

Blizzara · 2024-10-28T01:31:47Z

Adds testing for substrait-spark that going from POJO (ie. substrait-java plan) -> Proto -> POJO results in the same POJO.

The test showed a bunch of cases where that assertion fails, mainly due to the java pojos containing a derived outputType which was in many cases incorrect when created from the proto.

Blizzara · 2024-10-28T12:57:24Z

core/src/main/java/io/substrait/relation/ProtoRelConverter.java

-      // count only needs to be set when it is not -1
-      builder.count(rel.getCount());
-    }
+    var builder = Fetch.builder().input(input).offset(rel.getOffset()).count(rel.getCount());


while the idea of not setting count if it's -1 is fine, this makes roundtrip tests fail if count is set in the pojo. Alternative fix is to ensure in the pojo it's never set if -1.

Blizzara · 2024-10-28T12:57:55Z

spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala

@@ -131,7 +131,7 @@ class ToSubstraitRel extends AbstractLogicalPlanVisitor with Logging {
    val aggregates = collectAggregates(actualResultExprs, aggExprToOutputOrdinal)
    val aggOutputMap = aggregates.zipWithIndex.map {
      case (e, i) =>
-        AttributeReference(s"agg_func_$i", e.dataType)() -> e
+        AttributeReference(s"agg_func_$i", e.dataType, nullable = e.nullable)() -> e


these were causing wrong nullability for the type in the created pojos. I don't think that type field is used anywhere so it didn't cause harm, but still failed roundtrip tests as the type isn't written in proto and then it got correctly evaluated from other fields on read.

spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala

core/src/test/java/io/substrait/type/proto/ExtensionRoundtripTest.java

…column matching

…pojo

Blizzara · 2025-03-05T16:29:56Z

@vbarua @andrew-coleman this has been open for a while, but now finally ready for review! The testing change collides a bit with Andrew's #333, but either should be trivial to rebase once the other is in.

andrew-coleman · 2025-03-06T13:39:20Z

core/src/main/java/io/substrait/relation/Set.java

@@ -42,7 +46,38 @@ public static Set.SetOp fromProto(SetRel.SetOp proto) {

  @Override
  protected Type.Struct deriveRecordType() {
-    return getInputs().get(0).getRecordType();
+    // The different inputs may have schemas that differ in nullability, but not in type.
+    // In that case we should return a schema that is nullable where any of the inputs is nullable.


Looking at the docs for this (https://substrait.io/relations/logical_relations/#set-operation-types), the output nullability depends on which set operation is being performed.

yep, I realized that as well but forgot to fix 😅 I'll try to tomorrow..

vbarua

Found some time to actually look at this. Have one comment about the nullability of Scalar Subqueries, and one requests for tests for the Set output type derivation logic.

vbarua · 2025-03-14T17:42:22Z

core/src/main/java/io/substrait/relation/Set.java

+
+    // As defined in https://substrait.io/relations/logical_relations/#set-operation-types
+    return switch (getSetOp()) {
+      case UNKNOWN -> first; // alternative would be to throw an exception


meta: out of scope for this PR, but given that there is no default operation for when this is not specified it might make sense to not allow UKNOWN in the POJOs. That is, we should force the user to set the operation field.

vbarua · 2025-03-20T18:57:09Z

core/src/main/java/io/substrait/expression/proto/ProtoExpressionConverter.java

+                                      "Scalar subquery must have exactly one field");
+                                }
+                                // Result can be null if the query returns no rows
+                                return TypeCreator.asNullable(type.fields().get(0));


// Result can be null if the query returns no rows

Is this actually the case? If you have a non-nullable column, but there are no values, that column is still non-nullable as far as I understand it.

In cany case, the spec indicates:

// A subquery with one row and one column. This is often an aggregate
// though not required to be.

So I think it would be safe to just use the nullability of the field as is. If there is no row returned by the subquery that's a violation of the spec as written.

Issue was that Spark was always reporting nullable=true, which meant the from-proto didn't match the from-spark version. However this "fix" I had was wrong indeed, I made a more correct one here: 4f7f274. Thanks!

vbarua · 2025-03-20T19:09:44Z

spark/src/test/scala/io/substrait/spark/SubstraitPlanTestBase.scala

    val protoPlan = io.substrait.proto.Rel.parseFrom(bytes)
    val substraitPlan2 =
      new ProtoRelConverter(extensionCollector, SparkExtension.COLLECTION).from(protoPlan)
+    substraitPlan2.shouldEqualPlainly(substraitPlan)


This is a good check to have generally ✨

core/src/main/java/io/substrait/relation/Set.java

Blizzara · 2025-03-27T10:49:04Z

spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala

@@ -550,10 +550,12 @@ private[logical] class WithLogicalSubQuery(toSubstraitRel: ToSubstraitRel)
    expr match {
      case s: ScalarSubquery if s.outerAttrs.isEmpty && s.joinCond.isEmpty =>
        val rel = toSubstraitRel.visit(s.plan)
+        val t =
+          s.plan.schema.fields.head // Using this instead of s.dataType/s.nullable to get correct nullability


There's two ScalarSubquery classes in Spark, the one we're using here is https://github.com/apache/spark/blob/9fe78e3d33499060467ecdc0c2631beae0b0316c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala#L411 which always sets nullability to true. However the other one picks the nullability based on the plan's schema: https://github.com/apache/spark/blob/9fe78e3d33499060467ecdc0c2631beae0b0316c/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala#L68. I'm not very sure why there's a difference, but picking it from the plan seems to align better with what we expect here (and makes the tests pass)

They both have override def nullable: Boolean = true (line 69 in your second link)?

Woops indeed, I can't read 🤦. Still, I think the change here is fine nonetheless.

andrew-coleman · 2025-03-27T11:32:19Z

core/src/main/java/io/substrait/relation/Set.java

+      int finalI = i;
+      boolean anyOtherIsRequired = rest.stream().anyMatch(t -> !t.fields().get(finalI).nullable());
+      fields.add(anyOtherIsRequired ? TypeCreator.asNotNullable(typeA) : typeA);
+    }


Looks good. Tiny nit - both of these functions share a lot of common code (only 2 lines differ) - could they be refactored somehow? Not important though :)

I had it before as a single function, but it ends up making it harder to read since while the difference is small, it's non-trivial :/

vbarua

Changes look good. Will merge them shortly.

Blizzara commented Oct 28, 2024

View reviewed changes

spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala Outdated Show resolved Hide resolved

Blizzara force-pushed the avo/stronger-testing branch 2 times, most recently from 623ee12 to e374c85 Compare November 21, 2024 17:31

Blizzara commented Nov 21, 2024

View reviewed changes

core/src/test/java/io/substrait/type/proto/ExtensionRoundtripTest.java Show resolved Hide resolved

Blizzara added 3 commits March 5, 2025 11:07

fix: converting proto to pojo should take into account join type for …

f25c736

…column matching

fix: support treestring for VirtualTableScan

16a7694

fix: correctly set nullability for aggregate references

44c83dd

Blizzara force-pushed the avo/stronger-testing branch from 13a3b99 to ad8c73b Compare March 5, 2025 10:07

Blizzara added 6 commits March 5, 2025 16:22

fix: correctly set nullability for aggregate grouping exprs

7fe9564

fix: correctly set type for scalar subquery when converting proto to …

88d95d1

…pojo

fix: spotless

8c9c677

fix: add assert to check the pojo-proto-roundtrip

00eb750

fix: handle fetch's count in a way that matches roundtrip

f6494e2

fix: spotless

fb36197

Blizzara force-pushed the avo/stronger-testing branch from b6b2307 to e7be830 Compare March 5, 2025 15:23

Blizzara changed the title ~~[wip] fix: add proto roundtrips for Spark tests and fix issues it surfaces~~ fix: add proto roundtrips for Spark tests and fix issues it surfaces Mar 5, 2025

Blizzara force-pushed the avo/stronger-testing branch 2 times, most recently from 58ac64e to 4451193 Compare March 5, 2025 15:43

Blizzara marked this pull request as ready for review March 5, 2025 15:48

Blizzara force-pushed the avo/stronger-testing branch 2 times, most recently from 2945e97 to fb788e1 Compare March 5, 2025 16:16

andrew-coleman reviewed Mar 6, 2025

View reviewed changes

Blizzara force-pushed the avo/stronger-testing branch 2 times, most recently from 2dc72e8 to 1925efc Compare March 7, 2025 13:32

fix: set proto-to-rel type nullability handling

f3fee70

Blizzara force-pushed the avo/stronger-testing branch from 1925efc to f3fee70 Compare March 7, 2025 13:38

Blizzara mentioned this pull request Mar 10, 2025

feat(spark): support ExistenceJoin internal join type #333

Merged

Blizzara mentioned this pull request Mar 14, 2025

feat(spark): support Struct type and literal, and include names for struct fields #342

Merged

andrew-coleman approved these changes Mar 14, 2025

View reviewed changes

Blizzara added 4 commits March 14, 2025 16:00

Merge branch 'main' into avo/stronger-testing

e0ee3e2

fix: use shouldEqualPlainly

75f2c8f

fix: mark ScalarSubquery result type as nullable

0bf2d9d

fix: fix ScalarSubquery result type in test

9cc4d13

vbarua reviewed Mar 20, 2025

View reviewed changes

Blizzara mentioned this pull request Mar 26, 2025

fix(spark): enable aliased expressions to round-trip #348

Merged

Blizzara added 3 commits March 27, 2025 11:12

fix: add testing for Set deriveRecordType and fix the logic..

a67d4a3

fix: properly fix scalar subquery nullability

4f7f274

Merge branch 'main' into avo/stronger-testing

1bb111e

Blizzara commented Mar 27, 2025

View reviewed changes

andrew-coleman reviewed Mar 27, 2025

View reviewed changes

vbarua approved these changes Mar 28, 2025

View reviewed changes

vbarua merged commit fd74922 into substrait-io:main Mar 28, 2025
12 of 13 checks passed

Blizzara deleted the avo/stronger-testing branch March 28, 2025 07:06

fix: add proto roundtrips for Spark tests and fix issues it surfaces #315

fix: add proto roundtrips for Spark tests and fix issues it surfaces #315

Uh oh!

Conversation

Blizzara commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blizzara Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Blizzara commented Mar 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Blizzara commented Oct 28, 2024 •

edited

Loading

Blizzara Oct 28, 2024 •

edited

Loading