-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311
feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311
Conversation
9d56a80
to
5ae5dbf
Compare
case BinaryType => Some(creator.BINARY) | ||
case DateType => Some(creator.DATE) | ||
case TimestampNTZType => Some(creator.precisionTimestamp(Util.MICROSECOND_PRECISION)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes producing from the deprecated timestamp type to the new precision-aware version. That may break some consumers if they haven't been updated yet.
5ae5dbf
to
721a62b
Compare
if (elements.isEmpty) { | ||
return emptyList(false, ToSubstraitType.convert(elementType, nullable = containsNull).get) | ||
} | ||
list(false, JavaConverters.asJavaIterable(elements)) // TODO: handle containsNull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling containsNull
here is non-trivial, since Substrait doesn't have such a concept. The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element. So we'll need to encode containsNull
into the elementType
. It can be done, but requires some more work (as there's no "withNullable" method or anything) and this PR is already big enough, so I'll do it as a followup.
Same applies for map
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what you mean here.
The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element.
Is this when going from Substrait to Spark or from Spark to Substrait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substrait itself doesn't contain the "containsNull" arg, so the Substrait -> Spark conversion has to infer it, and the Spark -> Substrait conversion doesn't provide the correct information to infer it based on.
Easiest to see from the commented-out test:
Starting with Spark: value: [1,null,3] type: ArrayType(IntegerType,true)
Gets converted into Substrait literal: ListLiteral{nullable=false, values=[I32Literal{nullable=false, value=1}, NullLiteral{nullable=false, type=I32{nullable=true}}, I32Literal{nullable=false, value=3}]}
Note that there is no "containsNull" anywhere. So the way we infer it when converting back into Substrait is to look at the first element and it's nullable
arg. Which in today's world is always nullabe=false
, leading to:
Spark: value: [1,null,3] type: ArrayType(IntegerType,false)
(Funnily enough that's the case even if the first element is a null.. but fixing that in itself wouldn't be enough, since the null might be elsewhere in the list or even in a different row, but we should have the "type" of the literal match across rows.)
I think fixing this requires changing the ToSubstraitLiteral.convert/convertWithValue
so that the nullability they set can be overridden in this case.
721a62b
to
798ac88
Compare
798ac88
to
ea7bdb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking pretty good. The tests help a lot. Left a couple of questions.
} | ||
|
||
override def visit(expr: Type.IntervalDay): DataType = { | ||
if (expr.precision() != Util.MICROSECOND_PRECISION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, the Spark DayTimeIntervalType
encodes time down to seconds, but can store fractional seconds.
From the Spark code
Internally, values of day-time intervals are stored in Long values as amount of time in terms of microseconds that are calculated by the formula: -/+ (246060 * DAY + 60*60 * HOUR + 60 * MINUTE + SECOND) * 1000000
So the default / supported precision is microseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, Spark always stores the value as microseconds.
It's possible this is unnecessarily strict, I checked that on DataFusion we don't care about the precision at all for the type conversion.. but I guess it doesn't hurt to be strict to begin with, we can relax it later if there's need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start strict and see how we go.
I checked that on DataFusion we don't care about the precision at all for the type conversion.
That's the case for now, but that may also change if we want to enforce semantics with Substrait strictly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was thinking about that today - problem is I do need Spark -> DataFusion to play nicely together and that's tough if Spark insists strictly on one precision and DF on another 😅 but then what is the risk of things going wrong if the precision don't match, that I'm not sure about.. on literals it all makes sense, but for types themselves?
override def visit(expr: SExpression.DateLiteral): Expression = { | ||
Literal(expr.value(), ToSubstraitType.convert(expr.getType)) | ||
} | ||
|
||
override def visit(expr: SExpression.PrecisionTimestampLiteral): Literal = { | ||
Literal( | ||
Util.toMicroseconds(expr.value(), expr.precision()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it generally safe to convert timestamp values like this? If we're going from nanoseconds to microseconds, this will be lossy. Would it better for reject plans with non-microsecond precision and force the callers to make their plans compatible by using microseconds explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe not, I changed it in fd01c2f to be an assert instead.
Though we may want to make it e.g. a user-choosable option later, or something, since e.g. Spark and DataFusion sometimes disagree on which precision to use (for IntervalDay, Spark uses micros and DF millis), and it'd be sad if the two would then be fundamentally incompatible. But for now we can just be strict.
spark/src/main/scala/io/substrait/spark/expression/ToSparkExpression.scala
Outdated
Show resolved
Hide resolved
if (elements.isEmpty) { | ||
return emptyList(false, ToSubstraitType.convert(elementType, nullable = containsNull).get) | ||
} | ||
list(false, JavaConverters.asJavaIterable(elements)) // TODO: handle containsNull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what you mean here.
The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element.
Is this when going from Substrait to Spark or from Spark to Substrait?
c43f399
to
410475f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I will merge this today.
45d9387 adds support for EmptyMap into
core
- if preferred, I can split that off into its own PRea7bdb1 adds support for bunch of types and literals into substrait-spark, as well as testing for all of the currently supported types
c43f399 adds Byte and Short (taken from #309, thanks @andrew-coleman - this just so that should be fine to merge these in either order and we'll have tests setup fully)