-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[Proposal] String function data type handling requirements #13552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
is a "config" argument well defined (I assume it means something like
This makes sense to me. I also did a little digging and found this: datafusion/datafusion/expr-common/src/signature.rs Lines 141 to 145 in 7553b3b
So maybe best practice would be that string functions used the
It might make sense to define what "non-contiguous" means in this context (does it mean having special implementations when there are multiple input arguments that are not exactly the same type -- e.g. Utf8View and Utf8)? |
FYI @findepi and @jayzhan211 in case you have opinions in this area |
Handling
why is it not recommended for >=3 args? For non-contiguous, is it data types like int and string (logical level)? |
Maybe it is due to the combinatorial explosion of code (e.g. if you have to handle all combinations of string types ( |
Yes. You can see the result of allowing that with something like the string_to_array function. I think in cases like that it makes more sense to just coerce the data arguments to the same type than to try and handle all the types. Preferably this would be done via the signature vs code in each function. |
👍🏻 |
Updated to add descriptions.
Updated.
Updated. |
Is your feature request related to a problem or challenge?
One of the things I've been thinking about when working on utf8view support in udfs is what exactly datafusion should support in terms of function signature types. Currently we haven't formalized what we expect functions to support and thus string functions are not consistent in terms of what they accept and what they generate.
@alamb also asked whether the level of specialization of a function was indeed required in #13403 (comment) and if a proposal to have guidelines for string functions should be made. This is my attempt at such a proposal.
Describe the solution you'd like
In the context of this proposal string functions are UDF's that accept and produce strings. This does exclusively mean udf's in
functions/string
andfunctions/unicode
.Data arguments are arguments that contain actual data that will be processed.
Config arguments are arguments that hold values that adjust how processing will occur. This could be regex's, concat separator, etc.
I would like to propose the following for DataFusion:
Dict(_, StringType)
. To ease implementation the type for all data arguments SHOULD be coerced to be the largest type among all the data arguments.Signature::String
or equivalent here.schema_force_view_types
==true
. Otherwise string functions SHOULD output string results in the same type as the received primary data argument.Describe alternatives you've considered
No response
The text was updated successfully, but these errors were encountered: