feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311

Blizzara · 2024-10-23T17:04:28Z

45d9387 adds support for EmptyMap into core - if preferred, I can split that off into its own PR
ea7bdb1 adds support for bunch of types and literals into substrait-spark, as well as testing for all of the currently supported types
c43f399 adds Byte and Short (taken from #309, thanks @andrew-coleman - this just so that should be fine to merge these in either order and we'll have tests setup fully)

Blizzara · 2024-10-24T12:49:34Z

spark/src/main/scala/io/substrait/spark/ToSubstraitType.scala

      case BinaryType => Some(creator.BINARY)
+      case DateType => Some(creator.DATE)
+      case TimestampNTZType => Some(creator.precisionTimestamp(Util.MICROSECOND_PRECISION))


This changes producing from the deprecated timestamp type to the new precision-aware version. That may break some consumers if they haven't been updated yet.

Blizzara · 2024-10-24T13:03:50Z

spark/src/main/scala/io/substrait/spark/expression/ToSubstraitLiteral.scala

+      if (elements.isEmpty) {
+        return emptyList(false, ToSubstraitType.convert(elementType, nullable = containsNull).get)
+      }
+      list(false, JavaConverters.asJavaIterable(elements)) // TODO: handle containsNull


Handling containsNull here is non-trivial, since Substrait doesn't have such a concept. The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element. So we'll need to encode containsNull into the elementType. It can be done, but requires some more work (as there's no "withNullable" method or anything) and this PR is already big enough, so I'll do it as a followup.

Same applies for map below

I'm not entirely sure what you mean here.

The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element.

Is this when going from Substrait to Spark or from Spark to Substrait?

Substrait itself doesn't contain the "containsNull" arg, so the Substrait -> Spark conversion has to infer it, and the Spark -> Substrait conversion doesn't provide the correct information to infer it based on.

Easiest to see from the commented-out test:
Starting with Spark: value: [1,null,3] type: ArrayType(IntegerType,true)
Gets converted into Substrait literal: ListLiteral{nullable=false, values=[I32Literal{nullable=false, value=1}, NullLiteral{nullable=false, type=I32{nullable=true}}, I32Literal{nullable=false, value=3}]}

Note that there is no "containsNull" anywhere. So the way we infer it when converting back into Substrait is to look at the first element and it's nullable arg. Which in today's world is always nullabe=false, leading to:

Spark: value: [1,null,3] type: ArrayType(IntegerType,false)

(Funnily enough that's the case even if the first element is a null.. but fixing that in itself wouldn't be enough, since the null might be elsewhere in the list or even in a different row, but we should have the "type" of the literal match across rows.)

I think fixing this requires changing the ToSubstraitLiteral.convert/convertWithValue so that the nullability they set can be overridden in this case.

spark/src/main/scala/io/substrait/spark/ToSubstraitType.scala

vbarua

Overall this is looking pretty good. The tests help a lot. Left a couple of questions.

core/src/main/java/io/substrait/type/TypeCreator.java

vbarua · 2024-10-24T21:05:06Z

spark/src/main/scala/io/substrait/spark/ToSubstraitType.scala

+  }
+
+  override def visit(expr: Type.IntervalDay): DataType = {
+    if (expr.precision() != Util.MICROSECOND_PRECISION) {


To confirm, the Spark DayTimeIntervalType encodes time down to seconds, but can store fractional seconds.

From the Spark code

Internally, values of day-time intervals are stored in Long values as amount of time in terms of microseconds that are calculated by the formula: -/+ (246060 * DAY + 60*60 * HOUR + 60 * MINUTE + SECOND) * 1000000

So the default / supported precision is microseconds?

Yea, Spark always stores the value as microseconds.

It's possible this is unnecessarily strict, I checked that on DataFusion we don't care about the precision at all for the type conversion.. but I guess it doesn't hurt to be strict to begin with, we can relax it later if there's need.

Let's start strict and see how we go.

I checked that on DataFusion we don't care about the precision at all for the type conversion.

That's the case for now, but that may also change if we want to enforce semantics with Substrait strictly.

Yea, I was thinking about that today - problem is I do need Spark -> DataFusion to play nicely together and that's tough if Spark insists strictly on one precision and DF on another 😅 but then what is the risk of things going wrong if the precision don't match, that I'm not sure about.. on literals it all makes sense, but for types themselves?

vbarua · 2024-10-24T21:12:25Z

spark/src/main/scala/io/substrait/spark/expression/ToSparkExpression.scala

  override def visit(expr: SExpression.DateLiteral): Expression = {
    Literal(expr.value(), ToSubstraitType.convert(expr.getType))
  }

+  override def visit(expr: SExpression.PrecisionTimestampLiteral): Literal = {
+    Literal(
+      Util.toMicroseconds(expr.value(), expr.precision()),


Is it generally safe to convert timestamp values like this? If we're going from nanoseconds to microseconds, this will be lossy. Would it better for reject plans with non-microsecond precision and force the callers to make their plans compatible by using microseconds explicitly?

Yeah, maybe not, I changed it in fd01c2f to be an assert instead.

Though we may want to make it e.g. a user-choosable option later, or something, since e.g. Spark and DataFusion sometimes disagree on which precision to use (for IntervalDay, Spark uses micros and DF millis), and it'd be sad if the two would then be fundamentally incompatible. But for now we can just be strict.

spark/src/main/scala/io/substrait/spark/expression/ToSparkExpression.scala

spark/src/main/scala/io/substrait/utils/Util.scala

vbarua · 2024-10-24T21:33:20Z

spark/src/main/scala/io/substrait/spark/expression/ToSubstraitLiteral.scala

+      if (elements.isEmpty) {
+        return emptyList(false, ToSubstraitType.convert(elementType, nullable = containsNull).get)
+      }
+      list(false, JavaConverters.asJavaIterable(elements)) // TODO: handle containsNull


I'm not entirely sure what you mean here.

The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element.

Is this when going from Substrait to Spark or from Spark to Substrait?

…ap, intervals, timestamps)

…ulas

vbarua

LGTM. I will merge this today.

Blizzara mentioned this pull request Oct 23, 2024

feat(spark): add Window support #307

Merged

Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 9d56a80 to 5ae5dbf Compare October 24, 2024 12:44

Blizzara commented Oct 24, 2024

View reviewed changes

Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 5ae5dbf to 721a62b Compare October 24, 2024 12:51

Blizzara commented Oct 24, 2024

View reviewed changes

spark/src/main/scala/io/substrait/spark/ToSubstraitType.scala Outdated Show resolved Hide resolved

Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 721a62b to 798ac88 Compare October 24, 2024 13:08

Blizzara changed the title ~~feat(spark): add support for more types and literals (binary, list, array, intervals, timestamps)~~ feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) Oct 24, 2024

Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 798ac88 to ea7bdb1 Compare October 24, 2024 13:10

Blizzara mentioned this pull request Oct 24, 2024

feat(spark): convert VirtualTableScan to LocalRelation #312

Merged

vbarua reviewed Oct 24, 2024

View reviewed changes

Blizzara added 2 commits October 25, 2024 10:01

feat: add support for Expression.EmptyMapLiteral

14a6b5f

feat(spark): add support for more types and literals (binary, list, m…

410475f

…ap, intervals, timestamps)

Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from c43f399 to 410475f Compare October 25, 2024 09:06

Blizzara added 2 commits October 25, 2024 11:09

fix: code review - no conversions for precision, fix IntervalDay form…

fd01c2f

…ulas

fix: spotless

eaf5381

vbarua approved these changes Oct 25, 2024

View reviewed changes

vbarua merged commit 513a049 into substrait-io:main Oct 25, 2024
13 checks passed

Blizzara deleted the avo/binary-list-map-interval-timestamp-types branch October 25, 2024 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311

feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311

Blizzara commented Oct 23, 2024 •

edited

Loading

Blizzara Oct 24, 2024 •

edited

Loading

Blizzara Oct 24, 2024 •

edited

Loading

vbarua Oct 24, 2024

Blizzara Oct 25, 2024

vbarua left a comment

vbarua Oct 24, 2024

Blizzara Oct 25, 2024

vbarua Oct 25, 2024

Blizzara Oct 25, 2024

vbarua Oct 24, 2024

Blizzara Oct 25, 2024

vbarua Oct 24, 2024

vbarua left a comment

feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311

feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311

Conversation

Blizzara commented Oct 23, 2024 • edited Loading

Blizzara Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Blizzara Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Blizzara commented Oct 23, 2024 •

edited

Loading

Blizzara Oct 24, 2024 •

edited

Loading

Blizzara Oct 24, 2024 •

edited

Loading