-
Notifications
You must be signed in to change notification settings - Fork 1.5k
refactor: use TypeSignature::Coercible for crypto functions #14826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: use TypeSignature::Coercible for crypto functions #14826
Conversation
3cb1d36
to
2dcee88
Compare
Hi @jayzhan211, I encountered some CI failures in the [SQL] EXPLAIN SELECT
digest(column1_utf8view, 'md5') as c
FROM test;
[Diff] (-expected|+actual)
logical_plan
- 01)Projection: digest(test.column1_utf8view, Utf8View("md5")) AS c
+ 01)Projection: digest(test.column1_utf8view, Utf8("md5")) AS c
02)--TableScan: test projection=[column1_utf8view] Should we:
I'm not sure which one is better, could you give me some suggestions or other best practices? thanks very much!! |
This looks like an improvement, we don't need to cast it to utf8view |
CI appears to be failing, so marking this one as a draft |
d3efb81
to
03b40f4
Compare
Hi @jayzhan211, It seems all the CI checks were passed (including sqlogicaltest), but when I created and printed a table by datafusion-cli , I got empty result DataFusion CLI v45.0.0
> CREATE TABLE t_source(
column1 String,
column2 String,
column3 String,
column4 String
) AS VALUES
('one', 'one', 'one', 'one'),
('two', 'two', '', 'two'),
(NULL, NULL, NULL, NULL),
('', '', '', ''),
('three', 'three', 'three', '');
0 row(s) fetched.
Elapsed 0.005 seconds.
> select * from t_source;
+---------+---------+---------+---------+
| column1 | column2 | column3 | column4 |
+---------+---------+---------+---------+
+---------+---------+---------+---------+
5 row(s) fetched.
Elapsed 0.001 seconds. That is weird because I only modified the implementation of crypto functions. Why is the display for the basic operation is changed? Could you give me some help or suggestions? Thx! |
Oh! I found the problem has solved after I reinstalled datafusion-cli 😆 |
pub fn digest_binary_array_impl<'a, BinaryArrType>( | ||
self, | ||
input_value: BinaryArrType, | ||
) -> ArrayRef | ||
where | ||
BinaryArrType: BinaryArrayType<'a>, | ||
{ | ||
match self { | ||
Self::Md5 => digest_to_array!(Md5, input_value), | ||
Self::Sha224 => digest_to_array!(Sha224, input_value), | ||
Self::Sha256 => digest_to_array!(Sha256, input_value), | ||
Self::Sha384 => digest_to_array!(Sha384, input_value), | ||
Self::Sha512 => digest_to_array!(Sha512, input_value), | ||
Self::Blake2b => digest_to_array!(Blake2b512, input_value), | ||
Self::Blake2s => digest_to_array!(Blake2s256, input_value), | ||
Self::Blake3 => { | ||
let binary_array: BinaryArray = input_value | ||
.iter() | ||
.map(|opt| { | ||
opt.map(|x| { | ||
let mut digest = Blake3::default(); | ||
digest.update(x); | ||
Blake3::finalize(&digest).as_bytes().to_vec() | ||
}) | ||
}) | ||
.collect(); | ||
Arc::new(binary_array) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that digest_binary_array_impl
and digest_utf8_array_impl
are similar, should we integrate them into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that these two array types don't share a common trait that would allow implementing a single generic function. Since the key difference is in how bytes are accessed, I'll keep these functions separate to maintain type safety.
vec![Utf8View, Utf8, LargeUtf8, Binary, LargeBinary], | ||
signature: Signature::one_of( | ||
vec![ | ||
TypeSignature::Coercible(vec![Coercion::new_exact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the return type is utf8_or_binary_to_binary_type
which means the signtaure of sha512 should be new_implicit(binary, string|binary)
Target is binary, and allowed source types are string and binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I met another error after I used new_implicit
😭
External error: query failed: DataFusion error: Arrow error: Compute error: Internal Error: Cannot cast BinaryView to BinaryArray of expected type
[SQL] select sha512(ascii_1) from test_basic_operator;
at test_files/string/./string_query.slt.part:1732
at test_files/string/dictionary_utf8.slt:58
I think the error came from arrow-rs
https://github.com/apache/arrow-rs/blob/5b04ca7bf756c5ca9edfba220060017684e2ecc6/arrow-cast/src/cast/dictionary.rs#L107-#L128
BinaryView => {
// `unpack_dictionary` can handle Utf8View/BinaryView types, but incurs unnecessary data copy of the value buffer.
// we handle it here to avoid the copy.
let dict_array = array
.as_dictionary::<K>()
.downcast_dict::<BinaryArray>()
.ok_or_else(|| {
ArrowError::ComputeError(
"Internal Error: Cannot cast BinaryView to BinaryArray of expected type"
.to_string(),
)
})?;
let binary_view = view_from_dict_values::<K, BinaryViewType, BinaryType>(
dict_array.values(),
dict_array.keys(),
)?;
Ok(Arc::new(binary_view))
}
_ => unpack_di
When the dictionary is coverted to BinaryView
, it can't be cast to BinaryArray
2f5bcfe
to
4f1d604
Compare
0db2c5b
to
38e3e41
Compare
38e3e41
to
6e6db4a
Compare
6e6db4a
to
2861ada
Compare
8faafb0
to
c2e0ae8
Compare
@jayzhan211 I think the PR is ready to be reviewed : ) |
you can mark it as ready for review |
@@ -1068,7 +1068,7 @@ EXPLAIN SELECT | |||
FROM test; | |||
---- | |||
logical_plan | |||
01)Projection: digest(test.column1_utf8view, Utf8View("md5")) AS c | |||
01)Projection: digest(test.column1_utf8view, Utf8("md5")) AS c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we avoid casting from utf8 to utf8view
select count(*) from test WHERE array_has([needle], needle); | ||
---- | ||
100000 | ||
|
||
# TODO: this should probably be possible to completely remove the filter as always true? | ||
query TT | ||
explain with test AS (SELECT substr(md5(i)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i)) | ||
explain with test AS (SELECT substr(md5(i::text)::text, 1, 32) as needle FROM generate_series(1, 100000) t(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md5(int64) is not supported, so the test is casted to string explicitly
Thanks @Chen-Yuan-Lai |
…4826) * refactor: use TypeSignature::Coercible for crypto functions * fix * fix * fix string_view.slt * fix signatrue for sha256 * remove unsed import * support binaryView * modify signature if sha256 function * remove unsed import * clean unused codes * rewrite function using new_implicit for sha512 * remove unused import * rewrite signature with new_implicit for other crypto functions * support null for md5 function * modify sqllogictest to fit allowed input type for md5 function --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Ian Lai <[email protected]>
…4826) * refactor: use TypeSignature::Coercible for crypto functions * fix * fix * fix string_view.slt * fix signatrue for sha256 * remove unsed import * support binaryView * modify signature if sha256 function * remove unsed import * clean unused codes * rewrite function using new_implicit for sha512 * remove unused import * rewrite signature with new_implicit for other crypto functions * support null for md5 function * modify sqllogictest to fit allowed input type for md5 function --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Ian Lai <[email protected]>
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
TypeSignature::Coercible
and logical type (string & binary) for crypto functionsDataType::BinaryView
forutf8_or_binary_to_binary_type
to support binaryViewdigest_binary_array
can digest binary and binaryViewdigest_binary_array_impl
, which can be used bydigest_binary_array
digest_process
to support binaryViewAre these changes tested?
Are there any user-facing changes?