-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[datafusion-spark] Add Spark-compatible hex function #15947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| DataType::LargeUtf8 | ||
| DataType::Binary | ||
| DataType::LargeBinary => Ok(vec![arg_types[0].clone()]), | ||
DataType::Dictionary(key_type, value_type) => match value_type.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin @alamb Is the correct way to handle dictionary-encoding coercion? I did not see this handled in the expm1
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simpler if you could use some of the existing signatures and then return something for signature()
:
&self.signature |
That way the built in coercion logic would take over. If that doesn't work for spark functions, perhaps we can define some spark compatible signatures (as in implement SparkSignature
that would return a https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TypeSignature.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove This seems fine to me. If the logic is unique to hex
, it probably doesn't make sense to extract it into a reusable function.
If the logic is reusable, we could add SparkSignature
like @alamb suggested. I've found that the existing DataFusion signatures are not always compatible with Spark's behavior. I've also found that a lot of Spark functions have custom and unique coercion behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also wait for a few more PRs to see if there is a common pattern emerging before refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin @alamb Is the correct way to handle dictionary-encoding coercion? I did not see this handled in the expm1 function.
FWIW I think it is the correct way to handle the dictionary coercion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove
Looks reasonable to me. I had a few suggestions, but I think they could also be done as a follow on PR
&self, | ||
_arg_types: &[DataType], | ||
) -> datafusion_common::Result<DataType> { | ||
Ok(DataType::Utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function outputs Utf8View
it will likely be more performant rather than Utf8
| DataType::LargeUtf8 | ||
| DataType::Binary | ||
| DataType::LargeBinary => Ok(vec![arg_types[0].clone()]), | ||
DataType::Dictionary(key_type, value_type) => match value_type.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simpler if you could use some of the existing signatures and then return something for signature()
:
&self.signature |
That way the built in coercion logic would take over. If that doesn't work for spark functions, perhaps we can define some spark compatible signatures (as in implement SparkSignature
that would return a https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TypeSignature.html)
DataType::Int64 => { | ||
let array = as_int64_array(array)?; | ||
|
||
let hexed_array: StringArray = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably make this much faster by avoiding the allocation by using StringBuilder directly
If you are using StringArray something like this:
https://docs.rs/arrow/latest/arrow/array/type.GenericStringBuilder.html#example-incrementally-writing-strings-with-stdfmtwrite
If you are using StringView I think you'll have to provide a fast path for shorter strings and use the StringView builder for longer strings
)?, | ||
}; | ||
|
||
let new_values: Vec<Option<String>> = dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can likely make this more efficient by avoiding the Strings here and reuse the dictionary keys with the new values
Something like
let new_dict_array = DictionaryArray::new(dict.keys().clone(), values)
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
query T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add some array level tests, which is described in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/spark/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the array level tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a few new tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove!
Just one tiny suggestion on adding array level tests in the .slt
file. Otherwise, everything LGTM!
} | ||
|
||
#[cfg(test)] | ||
mod test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credit for the tests goes to @tshauck who originally implemented hex
in apache/datafusion-comet#449
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
query T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the array level tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shehabgamin and @andygrove
What I think we should do is to merge this PR and file a follow on ticket for any follow on performance optimizations (aka avoid allocating strings)
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
query T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a few new tests
I filed #15986 |
I agree with @alamb that we should go ahead and merge this, and then I can update Comet to use it so we have confidence that this approach is working (I think it is). Thanks for the reviews @alamb and @shehabgamin. |
Which issue does this PR close?
Part of #15914 and apache/datafusion-comet#1704
Rationale for this change
Test the process for contributing a Comet function to datafusion-spark/
What changes are included in this PR?
Add new
hex
function.Are these changes tested?
Yes, unit tests and slt.
Are there any user-facing changes?
No.