-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(function): add greatest function #12474
base: main
Are you sure you want to change the base?
Conversation
This match the Spark implementation for greatest: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html
FYI #12357 (comment) |
Related to #6531 |
Hey @alamb, we talked about the test location at the beginning of the CMU talk you gave, where the tests are located? |
I think the tests I was talking about are described here: If you want to test the type of a function you can do it like select arrow_typeif(my_func(foo))) |
I see, thank you, how is your experience with debugging this kind of test? |
It is great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
let can_coerce_to_all = non_null_types.iter().all(|t| can_coerce_from(data_type, t)); | ||
|
||
if can_coerce_to_all { | ||
return Ok(data_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be looking for the common super type between types rather than hoping one arg has a type that's common super type for other
for example greatest(a_decimal_10_3, a decimal_10_4)
should return decimal(11,4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I use comparison_coercion
or is there a better suited function?
found type_union_resolution
which states in the function description it should be used for greatest
// do not accept less than 2 arguments. | ||
if args.len() < 2 { | ||
return exec_err!( | ||
"greatest was called with {} arguments. It requires at least 2.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some engines (eg SQL Server) allow greatest with single arg.
(It's no-op of course)
|
||
let cmp = make_comparator(lhs, rhs, SORT_OPTIONS)?; | ||
|
||
let len = lhs.len().min(rhs.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they array lengths should be equal (otherwise we would be losing data)
// If both arrays are not nested, have the same length and no nulls, we can use the faster vectorised kernel | ||
// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined. | ||
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case | ||
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should use logical null count
// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined. | ||
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case | ||
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 { | ||
return cmp::gt_eq(&lhs, &rhs).map_err(|e| e.into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test with float NaN values.
Which issue does this PR close?
Closes #12472
Rationale for this change
Support more operators
What changes are included in this PR?
Implement the Spark implementation for
greatest
:https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html
Are these changes tested?
I was not able to find where you put the functions test, the
coalesce
for example has only one test for the return type, I must missing somethingdatafusion/datafusion/functions/src/core/coalesce.rs
Lines 142 to 157 in 7bd7747
Are there any user-facing changes?
Yes, I need to add documentation, but first check if this feature is desired