-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add Extension Type / Metadata support for Scalar UDFs #15646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Extension Type / Metadata support for Scalar UDFs #15646
Conversation
Some of the concerns I have:
|
Just a link to my experiments starting from the Expr enum in case they are useful! #15036 |
I have updated the PR to use |
I think Aggregate and Window UDFs should come as a separate PR. I did notice however that for Aggregates the input portion is already viable with this PR. Since |
This sounds like a good idea to me, but I suggest we wait until DF 47 is shipped to merge it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR while this will be a painful downstream change for other crates, I don't really see any other way to support User Defined types / extension types and thus I think we should do make the change
I feel like this particular PR still leaves the code in an inconsistent state -- data type is still used somewhat inconsistently. If we are going to make this change I think we should fully update the API to be in terms of Field
instead of DataType
FYI @rluvaton as I think this is very related to
Shall we just change functions to return Field
? I think that would be the cleanest (though disruptive) solution
Just chiming in here. I also encountered the similar issues with UDFs not having access to (and being able to return) fields during my experiments. Thanks for working on this 🚀! I'd really love to help out here, as I am craving for better UDTs support in DF. Unfortunately, I am a bit swamped at the moment but I'll find a few hours if there is something I can help with. I also believe that we should update the return API to Field as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I have some questions primarily from the extension type/user-defined type angle, although I know that the primary motivation here is just metadata. It does seem like a thoughtful breaking change might result in a cleaner overall final design (but I'm very new here). In particular I wonder if a structure defined in datafusion (maybe ArgType
or ExprType
) with a to_field()
would be less confusing/more flexible than a Field
whose name and nullability is mostly ignored.
let array_ref = Arc::new(StringArray::from(array_values)) as ArrayRef; | ||
Ok(ColumnarValue::Array(array_ref)) | ||
} | ||
ColumnarValue::Scalar(value) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how extension scalars/literals would fit in here (perhaps Literal
will also need a lit_field()
or ScalarValue::Extension
would be needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I don't know. As a a work around someone could make a light UDF that returns a scalar value but with metadata. Alternatively we could add methods to the literal expression to give the ability to add metadata to it. I don't have a clear use case in my head, though, so I'm mostly spitballing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good call we'll need to add the metadata somewhere. I think we'll have to figure out the best place as a follow on PR (we can defer until we figure out the right place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#15797 for tracking
datafusion/core/tests/user_defined/user_defined_scalar_functions.rs
Outdated
Show resolved
Hide resolved
I need to take some time to review these comments and think more about it, likely next week. Also I'm dropping a note for myself that the current implementation isn't sufficient for my needs because I need the UDF to be able to compute the output field based on the arguments and my function signature for Case in point: |
What I was thinking is that if we are going to be changing ScalarFunction again so that instead of returning a The |
…ned an owned version of the metadata for simplicity
…s a lot of unit test updates, especially because the scalar arguments pass around borrowed values
I'm sorry the scope of this PR keeps growing, but I've tried to address the comments and I believe this is a cleaner interface. Now all physical expressions a required to implement |
@alamb Please let me know if you want a re-review or to merge in after CI finishes. I did make some decent sized changes per your suggestions. A lot of what changed are unit tests, just for the way we handle borrowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for working on this!
While I was taking a look the first time you fixed my main question (the relationship between return_type
and output_field
) 🙂 . I did my best to look through the whole change to make sure the functions I'm hoping to express can be expressed in this way and I believe that they can!
fn return_type(&self, _args: &[DataType]) -> Result<DataType> { | ||
unimplemented!( | ||
"this should never be called since return_field_from_args is implemented" | ||
); | ||
} | ||
|
||
fn return_field_from_args(&self, _args: ReturnFieldArgs) -> Result<Field> { | ||
Ok(Field::new(self.name(), DataType::UInt64, true) | ||
.with_metadata(self.metadata.clone())) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot I have an end to end example working on my company's code, so if there are still gaps from what you need please let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I'll do testing as well on our suite when this merges and see if we can remove our internal workaround 🙂
Hi @timsaucer. I'd like to review this before getting merged. If it's not blocking anything, would it be okay to wait until weekend? |
I would greatly appreciate the review, and I think it is worth waiting to get another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are impressive and it should open up more opportunities for users to customize their usage! Hopefully we don't need to make these breaking changes too often as all the projects depending on datafusion will have a hard time updating their code :)
I can imagine that we will need to update scalar udf in datafusion-python once datafusion v48 gets released, and it's exciting!
repackaging of the output data into a `Field`. The name you specify on the | ||
field is not important. It will be overwritten during planning. `ReturnInfo` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, it seems a little weird to me here that we ask users to provide a name value but then overwrite it. Is it considered acceptable (because it is part of arrow's Field class), or could we maybe hide this field from the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more work to ask them to pass in all of the other parts rather than reuse the Field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true. Sounds good to me
/// The output field associated with this expression | ||
fn return_field(&self, input_schema: &Schema) -> Result<Field> { | ||
Ok(Field::new( | ||
format!("{self}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is one of the places where we overwrite the name of the field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the place it is computed is here and then used later down in the same function.
datafusion/datafusion/expr/src/expr_schema.rs
Line 375 in e4d5846
let (relation, schema_name) = self.qualified_name(); |
@berkaysynnada do you think you'll be able to review soon? I know we wanted to get this in earlier in the 48 cycle to shake out any bugs since it is a big change |
Sorry, I wasn't able to finish it yet, and it doesn't look like I can do in the next few days either. I don't want to block progress, so I'll do a post-merge review. |
Thank you, everyone, for the thoughtful discussions and reviews. |
The march to supporting user defined types has begun! |
* Add in plumbing to pass around metadata for physical expressions * Adding argument metadata to scalar argument struct * Since everywhere we use this we immediately clone, go ahead and returned an owned version of the metadata for simplicity * Cargo fmt * Benchmarks required args_metadata in tests * Clippy warnings * Switching over to passing Field around instead of metadata so we can handle extension types directly * Switching return_type_from_args to return_field_from_args * Updates to unit tests for switching to field instead of data_type * Resolve unit test issues * Update after rebase on main * GetFieldFunc should return the field it finds instead of creating a new one * Get metadata from scalar functions * Change expr_schema to use to_field primarily instead of individual calls for getting data type, nullability, and schema * Scalar function arguments should take return field instead of return data type now * subquery should just get the field from below and not lose potential metadata * Update comment * Remove output_field now that we've determined it using return_field_from_args * Change name to_field to field_from_column to be more consistent with the usage and prevent misconception about if we are doing some conversion * Minor moving around of the explicit lifetimes in the struct definition * Change physical expression to require to output a field which requires a lot of unit test updates, especially because the scalar arguments pass around borrowed values * Change name from output_field to return_field to be more consistent * Update migration guide for DF48 with user defined functions * Whitespace * Docstring correction
Working `ST_AsBinary` even persisting CRS! Enabled by apache/datafusion#15646 ### Change list - Restore `geodatafusion` crate and restore `datafusion` dependency (pinned to upstream main) - Add `with_metadata` to all concrete arrays and to the `GeoArrowArray` trait so that we can replace the metadata (e.g. CRS) of an existing array. - Remove `metadata` field from `MixedGeometryArray` because the inner array of the GeometryCollectionArray doesn't need to store GeoArrow metadata.
Which issue does this PR close?
ReturnTypeInfo
to return aField
rather thanDataType
#14247 but it is a different solution than requestedRationale for this change
We have many users who wish to use extension data or other metadata when writing user defined functions. This PR enables two features for Scalar UDFs
What changes are included in this PR?
This is a fairly large change, but at a high level we add in a vector of argument fields to the
ScalarFunctionArgs
that gets passed when invoked. Additionally all Physical Expressions are now required to implement a new function to output their field. The rest of the work included is around plumbing these changes through the system, extracting the field from the input schema, and setting it as output.Are these changes tested?
datafusion/core/tests/user_defined/user_defined_scalar_functions.rs