Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(function): add greatest function #12474

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rluvaton
Copy link

Which issue does this PR close?

Closes #12472

Rationale for this change

Support more operators

What changes are included in this PR?

Implement the Spark implementation for greatest:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html

Are these changes tested?

I was not able to find where you put the functions test, the coalesce for example has only one test for the return type, I must missing something

mod test {
use arrow::datatypes::DataType;
use datafusion_expr::ScalarUDFImpl;
use crate::core;
#[test]
fn test_coalesce_return_types() {
let coalesce = core::coalesce::CoalesceFunc::new();
let return_type = coalesce
.return_type(&[DataType::Date32, DataType::Date32])
.unwrap();
assert_eq!(return_type, DataType::Date32);
}
}

Are there any user-facing changes?

Yes, I need to add documentation, but first check if this feature is desired

@Weijun-H
Copy link
Member

FYI #12357 (comment)

@comphead
Copy link
Contributor

Related to #6531

@rluvaton rluvaton closed this Sep 23, 2024
@rluvaton rluvaton deleted the add-greatest branch September 23, 2024 12:22
@rluvaton
Copy link
Author

rluvaton commented Sep 23, 2024

I was not able to find where you put the functions test, the coalesce for example has only one test for the return type, I must missing something

Hey @alamb, we talked about the test location at the beginning of the CMU talk you gave, where the tests are located?

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

I think the tests I was talking about are described here:

https://github.com/apache/datafusion/tree/4a3c09a9316fb8940aeb1c5b9b48bc3b7259d5d4/datafusion/sqllogictest#readme

If you want to test the type of a function you can do it like

select arrow_typeif(my_func(foo)))

@rluvaton
Copy link
Author

I see, thank you, how is your experience with debugging this kind of test?

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

It is great!

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment on lines +127 to +130
let can_coerce_to_all = non_null_types.iter().all(|t| can_coerce_from(data_type, t));

if can_coerce_to_all {
return Ok(data_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be looking for the common super type between types rather than hoping one arg has a type that's common super type for other

for example greatest(a_decimal_10_3, a decimal_10_4) should return decimal(11,4)

Copy link
Author

@rluvaton rluvaton Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I use ‎comparison_coercion‎ or is there a better suited function?

found type_union_resolution which states in the function description it should be used for greatest

// do not accept less than 2 arguments.
if args.len() < 2 {
return exec_err!(
"greatest was called with {} arguments. It requires at least 2.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some engines (eg SQL Server) allow greatest with single arg.
(It's no-op of course)


let cmp = make_comparator(lhs, rhs, SORT_OPTIONS)?;

let len = lhs.len().min(rhs.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they array lengths should be equal (otherwise we would be losing data)

// If both arrays are not nested, have the same length and no nulls, we can use the faster vectorised kernel
// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined.
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should use logical null count

// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined.
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 {
return cmp::gt_eq(&lhs, &rhs).map_err(|e| e.into());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test with float NaN values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add greatest function
5 participants