Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) #311

Merged

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Oct 23, 2024

45d9387 adds support for EmptyMap into core - if preferred, I can split that off into its own PR
ea7bdb1 adds support for bunch of types and literals into substrait-spark, as well as testing for all of the currently supported types
c43f399 adds Byte and Short (taken from #309, thanks @andrew-coleman - this just so that should be fine to merge these in either order and we'll have tests setup fully)

@Blizzara Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 9d56a80 to 5ae5dbf Compare October 24, 2024 12:44
case BinaryType => Some(creator.BINARY)
case DateType => Some(creator.DATE)
case TimestampNTZType => Some(creator.precisionTimestamp(Util.MICROSECOND_PRECISION))
Copy link
Contributor Author

@Blizzara Blizzara Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes producing from the deprecated timestamp type to the new precision-aware version. That may break some consumers if they haven't been updated yet.

@Blizzara Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 5ae5dbf to 721a62b Compare October 24, 2024 12:51
if (elements.isEmpty) {
return emptyList(false, ToSubstraitType.convert(elementType, nullable = containsNull).get)
}
list(false, JavaConverters.asJavaIterable(elements)) // TODO: handle containsNull
Copy link
Contributor Author

@Blizzara Blizzara Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling containsNull here is non-trivial, since Substrait doesn't have such a concept. The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element. So we'll need to encode containsNull into the elementType. It can be done, but requires some more work (as there's no "withNullable" method or anything) and this PR is already big enough, so I'll do it as a followup.

Same applies for map below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you mean here.

The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element.

Is this when going from Substrait to Spark or from Spark to Substrait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substrait itself doesn't contain the "containsNull" arg, so the Substrait -> Spark conversion has to infer it, and the Spark -> Substrait conversion doesn't provide the correct information to infer it based on.

Easiest to see from the commented-out test:
Starting with Spark: value: [1,null,3] type: ArrayType(IntegerType,true)
Gets converted into Substrait literal: ListLiteral{nullable=false, values=[I32Literal{nullable=false, value=1}, NullLiteral{nullable=false, type=I32{nullable=true}}, I32Literal{nullable=false, value=3}]}

Note that there is no "containsNull" anywhere. So the way we infer it when converting back into Substrait is to look at the first element and it's nullable arg. Which in today's world is always nullabe=false, leading to:

Spark: value: [1,null,3] type: ArrayType(IntegerType,false)

(Funnily enough that's the case even if the first element is a null.. but fixing that in itself wouldn't be enough, since the null might be elsewhere in the list or even in a different row, but we should have the "type" of the literal match across rows.)

I think fixing this requires changing the ToSubstraitLiteral.convert/convertWithValue so that the nullability they set can be overridden in this case.

@Blizzara Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 721a62b to 798ac88 Compare October 24, 2024 13:08
@Blizzara Blizzara changed the title feat(spark): add support for more types and literals (binary, list, array, intervals, timestamps) feat(spark): add support for more types and literals (binary, list, map, intervals, timestamps) Oct 24, 2024
@Blizzara Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from 798ac88 to ea7bdb1 Compare October 24, 2024 13:10
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking pretty good. The tests help a lot. Left a couple of questions.

}

override def visit(expr: Type.IntervalDay): DataType = {
if (expr.precision() != Util.MICROSECOND_PRECISION) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, the Spark DayTimeIntervalType encodes time down to seconds, but can store fractional seconds.

From the Spark code

Internally, values of day-time intervals are stored in Long values as amount of time in terms of microseconds that are calculated by the formula: -/+ (246060 * DAY + 60*60 * HOUR + 60 * MINUTE + SECOND) * 1000000

So the default / supported precision is microseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, Spark always stores the value as microseconds.

It's possible this is unnecessarily strict, I checked that on DataFusion we don't care about the precision at all for the type conversion.. but I guess it doesn't hurt to be strict to begin with, we can relax it later if there's need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start strict and see how we go.

I checked that on DataFusion we don't care about the precision at all for the type conversion.

That's the case for now, but that may also change if we want to enforce semantics with Substrait strictly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking about that today - problem is I do need Spark -> DataFusion to play nicely together and that's tough if Spark insists strictly on one precision and DF on another 😅 but then what is the risk of things going wrong if the precision don't match, that I'm not sure about.. on literals it all makes sense, but for types themselves?

override def visit(expr: SExpression.DateLiteral): Expression = {
Literal(expr.value(), ToSubstraitType.convert(expr.getType))
}

override def visit(expr: SExpression.PrecisionTimestampLiteral): Literal = {
Literal(
Util.toMicroseconds(expr.value(), expr.precision()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it generally safe to convert timestamp values like this? If we're going from nanoseconds to microseconds, this will be lossy. Would it better for reject plans with non-microsecond precision and force the callers to make their plans compatible by using microseconds explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe not, I changed it in fd01c2f to be an assert instead.

Though we may want to make it e.g. a user-choosable option later, or something, since e.g. Spark and DataFusion sometimes disagree on which precision to use (for IntervalDay, Spark uses micros and DF millis), and it'd be sad if the two would then be fundamentally incompatible. But for now we can just be strict.

spark/src/main/scala/io/substrait/utils/Util.scala Outdated Show resolved Hide resolved
if (elements.isEmpty) {
return emptyList(false, ToSubstraitType.convert(elementType, nullable = containsNull).get)
}
list(false, JavaConverters.asJavaIterable(elements)) // TODO: handle containsNull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you mean here.

The nullability of the non-empty list is taken from the nullability of the element type, which is gotten from the first element.

Is this when going from Substrait to Spark or from Spark to Substrait?

@Blizzara Blizzara force-pushed the avo/binary-list-map-interval-timestamp-types branch from c43f399 to 410475f Compare October 25, 2024 09:06
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will merge this today.

@vbarua vbarua merged commit 513a049 into substrait-io:main Oct 25, 2024
13 checks passed
@Blizzara Blizzara deleted the avo/binary-list-map-interval-timestamp-types branch October 25, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants