Skip to content

Add Extension Type / Metadata support for Scalar UDFs #15646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 29, 2025

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Apr 8, 2025

Which issue does this PR close?

Rationale for this change

We have many users who wish to use extension data or other metadata when writing user defined functions. This PR enables two features for Scalar UDFs

  • When invoking the UDF, the field of each input column will be available, if it exists
  • The UDF can also specify output field that is attached to the schema of the record batch

What changes are included in this PR?

This is a fairly large change, but at a high level we add in a vector of argument fields to the ScalarFunctionArgs that gets passed when invoked. Additionally all Physical Expressions are now required to implement a new function to output their field. The rest of the work included is around plumbing these changes through the system, extracting the field from the input schema, and setting it as output.

Are these changes tested?

  • All existing unit tests pass.
  • Additional unit tests exercising this feature is added in datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate labels Apr 8, 2025
@timsaucer
Copy link
Contributor Author

Some of the concerns I have:

  • Canonical extension types in arrow-rs have an implementation of TryFrom<&Field> which lends towards the original Issue's suggestion of passing Field rather than just the metadata.
  • I need to add unit tests to make sure other operations don't throw away the metadata. I don't know if there is a one size fits all solution for which should an which should not pass through metadata. For example, alias should definitely keep the metadata IMO but what about functions that take two inputs? And all of the other single input functions - it's not clear if we need to handle this one a case by case basis or not.
  • If we do switch over to Field then it isn't immediately obvious to me what the name should be for the field. I need to investigate this a little more because there might already be a solution.

@paleolimbot
Copy link
Member

Just a link to my experiments starting from the Expr enum in case they are useful! #15036

@timsaucer
Copy link
Contributor Author

I have updated the PR to use Field instead of metadata HashMap<String, String>. In doing so we can now use extension types directly. I've added a second unit test that uses extension types, both canonical and a user defined.

@timsaucer timsaucer changed the title Add metadata support for Scalar UDFs Add Extension Type / Metadata support for Scalar UDFs Apr 9, 2025
@timsaucer timsaucer self-assigned this Apr 9, 2025
@timsaucer
Copy link
Contributor Author

I think Aggregate and Window UDFs should come as a separate PR. I did notice however that for Aggregates the input portion is already viable with this PR. Since AccumulatorArgs already passes in the input physical expression and input schema we would be able to compute the input fields. I've tested this locally with success. For Window functions we will want to add in the input schema.

@timsaucer timsaucer marked this pull request as ready for review April 9, 2025 19:14
@paleolimbot
Copy link
Member

I'll take a look this evening!

It's mostly updating to Arrow 55, but #15663 (in particular e37ef60 ) is the change required to support Extension types if arrow-rs supported them (minus reading them from files, which is admittedly a big minus).

@alamb
Copy link
Contributor

alamb commented Apr 10, 2025

This sounds like a good idea to me, but I suggest we wait until DF 47 is shipped to merge it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR while this will be a painful downstream change for other crates, I don't really see any other way to support User Defined types / extension types and thus I think we should do make the change

I feel like this particular PR still leaves the code in an inconsistent state -- data type is still used somewhat inconsistently. If we are going to make this change I think we should fully update the API to be in terms of Field instead of DataType

FYI @rluvaton as I think this is very related to

Shall we just change functions to return Field? I think that would be the cleanest (though disruptive) solution

@tobixdev
Copy link
Contributor

Just chiming in here. I also encountered the similar issues with UDFs not having access to (and being able to return) fields during my experiments.

Thanks for working on this 🚀! I'd really love to help out here, as I am craving for better UDTs support in DF. Unfortunately, I am a bit swamped at the moment but I'll find a few hours if there is something I can help with.

I also believe that we should update the return API to Field as well.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I have some questions primarily from the extension type/user-defined type angle, although I know that the primary motivation here is just metadata. It does seem like a thoughtful breaking change might result in a cleaner overall final design (but I'm very new here). In particular I wonder if a structure defined in datafusion (maybe ArgType or ExprType) with a to_field() would be less confusing/more flexible than a Field whose name and nullability is mostly ignored.

let array_ref = Arc::new(StringArray::from(array_values)) as ArrayRef;
Ok(ColumnarValue::Array(array_ref))
}
ColumnarValue::Scalar(value) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how extension scalars/literals would fit in here (perhaps Literal will also need a lit_field() or ScalarValue::Extension would be needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I don't know. As a a work around someone could make a light UDF that returns a scalar value but with metadata. Alternatively we could add methods to the literal expression to give the ability to add metadata to it. I don't have a clear use case in my head, though, so I'm mostly spitballing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good call we'll need to add the metadata somewhere. I think we'll have to figure out the best place as a follow on PR (we can defer until we figure out the right place)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#15797 for tracking

@rluvaton
Copy link
Contributor

FYI @rluvaton as I think this is very related to

Shall we just change functions to return Field? I think that would be the cleanest (though disruptive) solution

@alamb Not sure how this is solving the problem that I said in the issue

@timsaucer
Copy link
Contributor Author

I need to take some time to review these comments and think more about it, likely next week. Also I'm dropping a note for myself that the current implementation isn't sufficient for my needs because I need the UDF to be able to compute the output field based on the arguments and my function signature for output_field will not enable that.

Case in point: GetFieldFunc should return the field with meta data for the subfield of the struct, so it needs to know which field it was called for.

@timsaucer timsaucer marked this pull request as draft April 11, 2025 18:35
@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

FYI @rluvaton as I think this is very related to

Shall we just change functions to return Field? I think that would be the cleanest (though disruptive) solution

@alamb Not sure how this is solving the problem that I said in the issue

What I was thinking is that if we are going to be changing ScalarFunction again so that instead of returning a DataType for return_type_from_args it would return a Field

The Field includes nullability information (as well as extension type information)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 21, 2025
@timsaucer
Copy link
Contributor Author

I'm sorry the scope of this PR keeps growing, but I've tried to address the comments and I believe this is a cleaner interface. Now all physical expressions a required to implement return_field (which I renamed from output_field to be consistent). It meant updating a lot of unit tests.

@timsaucer
Copy link
Contributor Author

@alamb Please let me know if you want a re-review or to merge in after CI finishes. I did make some decent sized changes per your suggestions. A lot of what changed are unit tests, just for the way we handle borrowed Field in the invocation arguments.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for working on this!

While I was taking a look the first time you fixed my main question (the relationship between return_type and output_field) 🙂 . I did my best to look through the whole change to make sure the functions I'm hoping to express can be expressed in this way and I believe that they can!

Comment on lines +1410 to +1419
fn return_type(&self, _args: &[DataType]) -> Result<DataType> {
unimplemented!(
"this should never be called since return_field_from_args is implemented"
);
}

fn return_field_from_args(&self, _args: ReturnFieldArgs) -> Result<Field> {
Ok(Field::new(self.name(), DataType::UInt64, true)
.with_metadata(self.metadata.clone()))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot I have an end to end example working on my company's code, so if there are still gaps from what you need please let me know!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'll do testing as well on our suite when this merges and see if we can remove our internal workaround 🙂

@berkaysynnada
Copy link
Contributor

Hi @timsaucer. I'd like to review this before getting merged. If it's not blocking anything, would it be okay to wait until weekend?

@timsaucer
Copy link
Contributor Author

Hi @timsaucer. I'd like to review this before getting merged. If it's not blocking anything, would it be okay to wait until weekend?

I would greatly appreciate the review, and I think it is worth waiting to get another look.

Copy link

@crystalxyz crystalxyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are impressive and it should open up more opportunities for users to customize their usage! Hopefully we don't need to make these breaking changes too often as all the projects depending on datafusion will have a hard time updating their code :)
I can imagine that we will need to update scalar udf in datafusion-python once datafusion v48 gets released, and it's exciting!

Comment on lines +34 to +35
repackaging of the output data into a `Field`. The name you specify on the
field is not important. It will be overwritten during planning. `ReturnInfo`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, it seems a little weird to me here that we ask users to provide a name value but then overwrite it. Is it considered acceptable (because it is part of arrow's Field class), or could we maybe hide this field from the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more work to ask them to pass in all of the other parts rather than reuse the Field

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true. Sounds good to me

/// The output field associated with this expression
fn return_field(&self, input_schema: &Schema) -> Result<Field> {
Ok(Field::new(
format!("{self}"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is one of the places where we overwrite the name of the field?

Copy link
Contributor Author

@timsaucer timsaucer Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the place it is computed is here and then used later down in the same function.

let (relation, schema_name) = self.qualified_name();

@timsaucer
Copy link
Contributor Author

@berkaysynnada do you think you'll be able to review soon? I know we wanted to get this in earlier in the 48 cycle to shake out any bugs since it is a big change

@berkaysynnada
Copy link
Contributor

@berkaysynnada do you think you'll be able to review soon? I know we wanted to get this in earlier in the 48 cycle to shake out any bugs since it is a big change

Sorry, I wasn't able to finish it yet, and it doesn't look like I can do in the next few days either. I don't want to block progress, so I'll do a post-merge review.

@timsaucer
Copy link
Contributor Author

Thank you, everyone, for the thoughtful discussions and reviews.

@timsaucer timsaucer merged commit 2d80194 into apache:main Apr 29, 2025
31 checks passed
@timsaucer timsaucer deleted the feat/physical_expr_evaluate_metadata branch April 29, 2025 13:33
@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

The march to supporting user defined types has begun!

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* Add in plumbing to pass around metadata for physical expressions

* Adding argument metadata to scalar argument struct

* Since everywhere we use this we immediately clone, go ahead and returned an owned version of the metadata for simplicity

* Cargo fmt

* Benchmarks required args_metadata in tests

* Clippy warnings

* Switching over to passing Field around instead of metadata so we can handle extension types directly

* Switching return_type_from_args to return_field_from_args

* Updates to unit tests for switching to field instead of data_type

* Resolve unit test issues

* Update after rebase on main

* GetFieldFunc should return the field it finds instead of creating a new one

* Get metadata from scalar functions

* Change expr_schema to use to_field primarily instead of individual calls for getting data type, nullability, and schema

* Scalar function arguments should take return field instead of return data type now

* subquery should just get the field from below and not lose potential metadata

* Update comment

* Remove output_field now that we've determined it using return_field_from_args

* Change name to_field to field_from_column to be more consistent with the usage and prevent misconception about if we are doing some conversion

* Minor moving around of the explicit lifetimes in the struct definition

* Change physical expression to require to output a field which requires a lot of unit test updates, especially because the scalar arguments pass around borrowed values

* Change name from output_field to return_field to be more consistent

* Update migration guide for DF48 with user defined functions

* Whitespace

* Docstring correction
kylebarron added a commit to geoarrow/geoarrow-rs that referenced this pull request May 7, 2025
Working `ST_AsBinary` even persisting CRS!

Enabled by apache/datafusion#15646

### Change list

- Restore `geodatafusion` crate and restore `datafusion` dependency
(pinned to upstream main)
- Add `with_metadata` to all concrete arrays and to the `GeoArrowArray`
trait so that we can replace the metadata (e.g. CRS) of an existing
array.
- Remove `metadata` field from `MixedGeometryArray` because the inner
array of the GeometryCollectionArray doesn't need to store GeoArrow
metadata.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change ReturnTypeInfo to return a Field rather than DataType
7 participants