Skip to content

refactor: use TypeSignature::Coercible for crypto functions #14826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 27, 2025

Conversation

Chen-Yuan-Lai
Copy link
Contributor

@Chen-Yuan-Lai Chen-Yuan-Lai commented Feb 22, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  1. Use TypeSignature::Coercible and logical type (string & binary) for crypto functions
  2. Add DataType::BinaryView for utf8_or_binary_to_binary_type to support binaryView
  3. Make digest_binary_array can digest binary and binaryView
  4. Implement digest_binary_array_impl, which can be used by digest_binary_array
  5. Make digest_process to support binaryView

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added functions Changes to functions implementation common Related to common crate labels Feb 22, 2025
@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch 3 times, most recently from 3cb1d36 to 2dcee88 Compare February 23, 2025 16:25
@github-actions github-actions bot removed the common Related to common crate label Feb 23, 2025
@Chen-Yuan-Lai
Copy link
Contributor Author

Hi @jayzhan211, I encountered some CI failures in the digest function. It seems the Exact(vec![Utf8View, Utf8View]) case can't be replaced with TypeSignature::Coercible because TypeSignature::Coercible processes each parameter independently, without considering the types of other parameters

[SQL] EXPLAIN SELECT
  digest(column1_utf8view, 'md5') as c
FROM test;
[Diff] (-expected|+actual)
    logical_plan
-   01)Projection: digest(test.column1_utf8view, Utf8View("md5")) AS c
+   01)Projection: digest(test.column1_utf8view, Utf8("md5")) AS c
    02)--TableScan: test projection=[column1_utf8view]

Should we:

  1. Implement a new TypeSignature::Coercible method to support this feature?
  2. Mixing the usage TypeSignature::Coercible and Exact(vec![Utf8View, Utf8View]) in digest function?

I'm not sure which one is better, could you give me some suggestions or other best practices? thanks very much!!

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 25, 2025

[SQL] EXPLAIN SELECT
  digest(column1_utf8view, 'md5') as c
FROM test;
[Diff] (-expected|+actual)
    logical_plan
-   01)Projection: digest(test.column1_utf8view, Utf8View("md5")) AS c
+   01)Projection: digest(test.column1_utf8view, Utf8("md5")) AS c
    02)--TableScan: test projection=[column1_utf8view]

This looks like an improvement, we don't need to cast it to utf8view

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 25, 2025
@alamb
Copy link
Contributor

alamb commented Mar 3, 2025

CI appears to be failing, so marking this one as a draft

@alamb alamb marked this pull request as draft March 3, 2025 21:31
@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch from d3efb81 to 03b40f4 Compare March 6, 2025 15:49
@Chen-Yuan-Lai
Copy link
Contributor Author

Chen-Yuan-Lai commented Mar 9, 2025

Hi @jayzhan211, It seems all the CI checks were passed (including sqlogicaltest), but when I created and printed a table by datafusion-cli , I got empty result

DataFusion CLI v45.0.0
> CREATE TABLE t_source(
    column1 String,
    column2 String,
    column3 String,
    column4 String
) AS VALUES
('one', 'one', 'one', 'one'),
('two', 'two', '', 'two'),
(NULL, NULL, NULL, NULL),
('', '', '', ''),
('three', 'three', 'three', '');

0 row(s) fetched. 
Elapsed 0.005 seconds.

> select * from t_source;
+---------+---------+---------+---------+
| column1 | column2 | column3 | column4 |
+---------+---------+---------+---------+
+---------+---------+---------+---------+
5 row(s) fetched. 
Elapsed 0.001 seconds.

That is weird because I only modified the implementation of crypto functions. Why is the display for the basic operation is changed? Could you give me some help or suggestions? Thx!

@Chen-Yuan-Lai
Copy link
Contributor Author

Chen-Yuan-Lai commented Mar 9, 2025

Hi @jayzhan211, It seems all the CI checks were passed (including sqlogicaltest), but when I created and printed a table by datafusion-cli , I got empty result

Oh! I found the problem has solved after I reinstalled datafusion-cli 😆

Comment on lines +326 to +355
pub fn digest_binary_array_impl<'a, BinaryArrType>(
self,
input_value: BinaryArrType,
) -> ArrayRef
where
BinaryArrType: BinaryArrayType<'a>,
{
match self {
Self::Md5 => digest_to_array!(Md5, input_value),
Self::Sha224 => digest_to_array!(Sha224, input_value),
Self::Sha256 => digest_to_array!(Sha256, input_value),
Self::Sha384 => digest_to_array!(Sha384, input_value),
Self::Sha512 => digest_to_array!(Sha512, input_value),
Self::Blake2b => digest_to_array!(Blake2b512, input_value),
Self::Blake2s => digest_to_array!(Blake2s256, input_value),
Self::Blake3 => {
let binary_array: BinaryArray = input_value
.iter()
.map(|opt| {
opt.map(|x| {
let mut digest = Blake3::default();
digest.update(x);
Blake3::finalize(&digest).as_bytes().to_vec()
})
})
.collect();
Arc::new(binary_array)
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that digest_binary_array_impl and digest_utf8_array_impl are similar, should we integrate them into one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that these two array types don't share a common trait that would allow implementing a single generic function. Since the key difference is in how bytes are accessed, I'll keep these functions separate to maintain type safety.

vec![Utf8View, Utf8, LargeUtf8, Binary, LargeBinary],
signature: Signature::one_of(
vec![
TypeSignature::Coercible(vec![Coercion::new_exact(
Copy link
Contributor

@jayzhan211 jayzhan211 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return type is utf8_or_binary_to_binary_type
which means the signtaure of sha512 should be new_implicit(binary, string|binary)

Target is binary, and allowed source types are string and binary

Copy link
Contributor Author

@Chen-Yuan-Lai Chen-Yuan-Lai Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I met another error after I used new_implicit 😭

External error: query failed: DataFusion error: Arrow error: Compute error: Internal Error: Cannot cast BinaryView to BinaryArray of expected type
[SQL] select sha512(ascii_1) from test_basic_operator;
at test_files/string/./string_query.slt.part:1732
at test_files/string/dictionary_utf8.slt:58

I think the error came from arrow-rs
https://github.com/apache/arrow-rs/blob/5b04ca7bf756c5ca9edfba220060017684e2ecc6/arrow-cast/src/cast/dictionary.rs#L107-#L128

BinaryView => {
            // `unpack_dictionary` can handle Utf8View/BinaryView types, but incurs unnecessary data copy of the value buffer.
            // we handle it here to avoid the copy.
            let dict_array = array
                .as_dictionary::<K>()
                .downcast_dict::<BinaryArray>()
                .ok_or_else(|| {
                    ArrowError::ComputeError(
                        "Internal Error: Cannot cast BinaryView to BinaryArray of expected type"
                            .to_string(),
                    )
                })?;

            let binary_view = view_from_dict_values::<K, BinaryViewType, BinaryType>(
                dict_array.values(),
                dict_array.keys(),
            )?;
            Ok(Arc::new(binary_view))
        }
        _ => unpack_di

When the dictionary is coverted to BinaryView, it can't be cast to BinaryArray

@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch 2 times, most recently from 2f5bcfe to 4f1d604 Compare March 11, 2025 03:04
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Mar 11, 2025
@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch from 0db2c5b to 38e3e41 Compare March 13, 2025 09:34
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Mar 13, 2025
@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch from 38e3e41 to 6e6db4a Compare March 17, 2025 16:37
@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch from 6e6db4a to 2861ada Compare March 25, 2025 09:33
@Chen-Yuan-Lai Chen-Yuan-Lai force-pushed the use_coercible_for_crypto branch from 8faafb0 to c2e0ae8 Compare March 26, 2025 08:51
@Chen-Yuan-Lai
Copy link
Contributor Author

Chen-Yuan-Lai commented Mar 26, 2025

@jayzhan211 I think the PR is ready to be reviewed : )

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 26, 2025

you can mark it as ready for review

@jayzhan211 jayzhan211 marked this pull request as ready for review March 26, 2025 11:39
@@ -1068,7 +1068,7 @@ EXPLAIN SELECT
FROM test;
----
logical_plan
01)Projection: digest(test.column1_utf8view, Utf8View("md5")) AS c
01)Projection: digest(test.column1_utf8view, Utf8("md5")) AS c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we avoid casting from utf8 to utf8view

select count(*) from test WHERE array_has([needle], needle);
----
100000

# TODO: this should probably be possible to completely remove the filter as always true?
query TT
explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i))
explain with test AS (SELECT substr(md5(i::text)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md5(int64) is not supported, so the test is casted to string explicitly

@jayzhan211 jayzhan211 merged commit fbf2ffc into apache:main Mar 27, 2025
27 checks passed
@jayzhan211
Copy link
Contributor

Thanks @Chen-Yuan-Lai

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 28, 2025
…4826)

* refactor: use TypeSignature::Coercible for crypto functions

* fix

* fix

* fix string_view.slt

* fix signatrue for sha256

* remove unsed import

* support binaryView

* modify signature if sha256 function

* remove unsed import

* clean unused codes

* rewrite function using new_implicit for sha512

* remove unused import

* rewrite signature with new_implicit for other crypto functions

* support null for md5 function

* modify sqllogictest to fit allowed input type for md5 function

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>
Co-authored-by: Ian Lai <[email protected]>
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…4826)

* refactor: use TypeSignature::Coercible for crypto functions

* fix

* fix

* fix string_view.slt

* fix signatrue for sha256

* remove unsed import

* support binaryView

* modify signature if sha256 function

* remove unsed import

* clean unused codes

* rewrite function using new_implicit for sha512

* remove unused import

* rewrite signature with new_implicit for other crypto functions

* support null for md5 function

* modify sqllogictest to fit allowed input type for md5 function

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>
Co-authored-by: Ian Lai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeSignature::Coercible for crypto functions
3 participants