Skip to content

[Proposal] String function data type handling requirements #13552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Omega359 opened this issue Nov 24, 2024 · 7 comments
Open

[Proposal] String function data type handling requirements #13552

Omega359 opened this issue Nov 24, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@Omega359
Copy link
Contributor

Omega359 commented Nov 24, 2024

Is your feature request related to a problem or challenge?

One of the things I've been thinking about when working on utf8view support in udfs is what exactly datafusion should support in terms of function signature types. Currently we haven't formalized what we expect functions to support and thus string functions are not consistent in terms of what they accept and what they generate.

@alamb also asked whether the level of specialization of a function was indeed required in #13403 (comment) and if a proposal to have guidelines for string functions should be made. This is my attempt at such a proposal.

Describe the solution you'd like

In the context of this proposal string functions are UDF's that accept and produce strings. This does exclusively mean udf's in functions/string and functions/unicode.

Data arguments are arguments that contain actual data that will be processed.
Config arguments are arguments that hold values that adjust how processing will occur. This could be regex's, concat separator, etc.

I would like to propose the following for DataFusion:

  1. String functions MUST accept both scalar and array values for all data arguments (vs config such as regex's 'flags' arguments).
  2. String functions MUST accept scalar values for all config arguments but MAY accept both scalar and array if appropriate for the function.
  3. String functions MUST accept all valid string types for all data arguments including Dict(_, StringType). To ease implementation the type for all data arguments SHOULD be coerced to be the largest type among all the data arguments.
  4. String functions MAY choose to allow non-contiguous data types for data arguments but it is NOT RECOMMENDED for functions with 3 or more arguments. Non-contiguous here means data arguments that are not of the same type (for example, example_fn(utf8, LargeUtf8, utf8)), Best practice is to use Signature::String or equivalent here.
  5. String functions MAY choose to output in Utf8View instead of Utf8 if DataFusion is configured with schema_force_view_types == true. Otherwise string functions SHOULD output string results in the same type as the received primary data argument.
  6. String functions SHOULD rely on type coercion to handle non-string data. For example, concat('ab', 2, 'cc').
  7. String functions MUST handle non-control unicode textual character classes unless the function explicitly is designed for a particular character set (ascii for example)
  8. String functions SHOULD NOT attempt to specially handle unicode grapheme characters unless it's directly related to the function requirements.

Describe alternatives you've considered

No response

@Omega359 Omega359 added the enhancement New feature or request label Nov 24, 2024
@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

String functions MUST accept scalar values for all config arguments but MAY accept both scalar and array if appropriate for the function.

is a "config" argument well defined (I assume it means something like concat_ws(column, '**') where the function mostly follows

String functions SHOULD rely on type coercion to handle non-string data. For example, concat('ab', 2, 'cc').

This makes sense to me. I also did a little digging and found this:

/// Fixed number of arguments of all the same string types.
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8.
/// Null is considered as `Utf8` by default
/// Dictionary with string value type is also handled.
String(usize),

So maybe best practice would be that string functions used the Signature::String or equivalent coercion rules

String functions MAY choose to allow non-contiguous data types for data arguments but it is NOT RECOMMENDED for functions with 3 or more arguments.

It might make sense to define what "non-contiguous" means in this context (does it mean having special implementations when there are multiple input arguments that are not exactly the same type -- e.g. Utf8View and Utf8)?

@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

FYI @findepi and @jayzhan211 in case you have opinions in this area

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 26, 2024

Handling Dict(_, String) is no different from handling String

String functions MAY choose to allow non-contiguous data types for data arguments but it is NOT RECOMMENDED for functions with 3 or more arguments.

why is it not recommended for >=3 args? For non-contiguous, is it data types like int and string (logical level)?

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

Handling Dict(_, String) is no different from handling String

String functions MAY choose to allow non-contiguous data types for data arguments but it is NOT RECOMMENDED for functions with 3 or more arguments.

why is it not recommended for >=3 args? For non-contiguous, is it data types like int and string (logical level)?

Maybe it is due to the combinatorial explosion of code (e.g. if you have to handle all combinations of string types (Utf8, LargeUtf8 and Utf8Vew) for even three arguments, you end up with 3*3*3=27 specialized versions of the function 🤔

@Omega359
Copy link
Contributor Author

Maybe it is due to the combinatorial explosion of code (e.g. if you have to handle all combinations of string types (Utf8, LargeUtf8 and Utf8Vew) for even three arguments, you end up with 3*3*3=27 specialized versions of the function 🤔

Yes. You can see the result of allowing that with something like the string_to_array function.

I think in cases like that it makes more sense to just coerce the data arguments to the same type than to try and handle all the types. Preferably this would be done via the signature vs code in each function.

@Omega359
Copy link
Contributor Author

Handling Dict(_, String) is no different from handling String

👍🏻

@Omega359
Copy link
Contributor Author

String functions MUST accept scalar values for all config arguments but MAY accept both scalar and array if appropriate for the function.

is a "config" argument well defined (I assume it means something like concat_ws(column, '**') where the function mostly follows

Updated to add descriptions.

String functions SHOULD rely on type coercion to handle non-string data. For example, concat('ab', 2, 'cc').

This makes sense to me. I also did a little digging and found this:

/// Fixed number of arguments of all the same string types.
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8.
/// Null is considered as `Utf8` by default
/// Dictionary with string value type is also handled.
String(usize),

So maybe best practice would be that string functions used the Signature::String or equivalent coercion rules

Updated.

String functions MAY choose to allow non-contiguous data types for data arguments but it is NOT RECOMMENDED for functions with 3 or more arguments.

It might make sense to define what "non-contiguous" means in this context (does it mean having special implementations when there are multiple input arguments that are not exactly the same type -- e.g. Utf8View and Utf8)?

Updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants