Skip to content

[datafusion-spark] Add Spark-compatible hex function #15947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 8, 2025

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Part of #15914 and apache/datafusion-comet#1704

Rationale for this change

Test the process for contributing a Comet function to datafusion-spark/

What changes are included in this PR?

Add new hex function.

Are these changes tested?

Yes, unit tests and slt.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 5, 2025
| DataType::LargeUtf8
| DataType::Binary
| DataType::LargeBinary => Ok(vec![arg_types[0].clone()]),
DataType::Dictionary(key_type, value_type) => match value_type.as_ref() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin @alamb Is the correct way to handle dictionary-encoding coercion? I did not see this handled in the expm1 function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler if you could use some of the existing signatures and then return something for signature():

That way the built in coercion logic would take over. If that doesn't work for spark functions, perhaps we can define some spark compatible signatures (as in implement SparkSignature that would return a https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TypeSignature.html)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove This seems fine to me. If the logic is unique to hex, it probably doesn't make sense to extract it into a reusable function.

If the logic is reusable, we could add SparkSignature like @alamb suggested. I've found that the existing DataFusion signatures are not always compatible with Spark's behavior. I've also found that a lot of Spark functions have custom and unique coercion behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also wait for a few more PRs to see if there is a common pattern emerging before refactoring

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin @alamb Is the correct way to handle dictionary-encoding coercion? I did not see this handled in the expm1 function.

FWIW I think it is the correct way to handle the dictionary coercion

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove

Looks reasonable to me. I had a few suggestions, but I think they could also be done as a follow on PR

&self,
_arg_types: &[DataType],
) -> datafusion_common::Result<DataType> {
Ok(DataType::Utf8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function outputs Utf8View it will likely be more performant rather than Utf8

| DataType::LargeUtf8
| DataType::Binary
| DataType::LargeBinary => Ok(vec![arg_types[0].clone()]),
DataType::Dictionary(key_type, value_type) => match value_type.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler if you could use some of the existing signatures and then return something for signature():

That way the built in coercion logic would take over. If that doesn't work for spark functions, perhaps we can define some spark compatible signatures (as in implement SparkSignature that would return a https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TypeSignature.html)

DataType::Int64 => {
let array = as_int64_array(array)?;

let hexed_array: StringArray =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably make this much faster by avoiding the allocation by using StringBuilder directly

If you are using StringArray something like this:
https://docs.rs/arrow/latest/arrow/array/type.GenericStringBuilder.html#example-incrementally-writing-strings-with-stdfmtwrite

If you are using StringView I think you'll have to provide a fast path for shorter strings and use the StringView builder for longer strings

)?,
};

let new_values: Vec<Option<String>> = dict
Copy link
Contributor

@alamb alamb May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can likely make this more efficient by avoiding the Strings here and reuse the dictionary keys with the new values

Something like

let new_dict_array = DictionaryArray::new(dict.keys().clone(), values)

# specific language governing permissions and limitations
# under the License.

query T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add some array level tests, which is described in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/spark/README.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the array level tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a few new tests

Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove!

Just one tiny suggestion on adding array level tests in the .slt file. Otherwise, everything LGTM!

}

#[cfg(test)]
mod test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are great!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit for the tests goes to @tshauck who originally implemented hex in apache/datafusion-comet#449

# specific language governing permissions and limitations
# under the License.

query T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the array level tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shehabgamin and @andygrove

What I think we should do is to merge this PR and file a follow on ticket for any follow on performance optimizations (aka avoid allocating strings)

# specific language governing permissions and limitations
# under the License.

query T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a few new tests

@andygrove
Copy link
Member Author

Thanks @shehabgamin and @andygrove

What I think we should do is to merge this PR and file a follow on ticket for any follow on performance optimizations (aka avoid allocating strings)

I filed #15986

@andygrove
Copy link
Member Author

I agree with @alamb that we should go ahead and merge this, and then I can update Comet to use it so we have confidence that this approach is working (I think it is). Thanks for the reviews @alamb and @shehabgamin.

@andygrove andygrove merged commit d01082e into apache:main May 8, 2025
27 checks passed
@andygrove andygrove deleted the spark-hex branch May 8, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants