-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Interval Arithmetic Updates #8276
Refactor Interval Arithmetic Updates #8276
Conversation
…al-arithmetic-updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this very carefully and collaborated with @berkaysynnada closely on this. Since this will be a foundational utility for many use cases, we are looking forward to reviews from other pairs of eyes as well -- @alamb, PTAL
I will try to review this carefully over the next day or two, but given its size I don't think I will be able to provide a super detailed review. However, since @ozankabak has already done so I feel it is in good hands. I will focus on the interaction with other components / test updates. |
There is such a diff size because we moved the main code. The key changes are in interval_arithmetic.rs and cp_solver.rs. We have carefully reviewed and thoroughly tested the functions for arithmetic operations on intervals. Reviewing from a higher-level perspective might be more time-efficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada -- I think this API is much nicer.
I clearly wasn't able to give the whole thing a thorough review, but I did review test cases (outside the interval implementation itself) that were changed as well as the API where it interacted with the rest of the system
All in all I think it is really nicely done.
I had some additional improvement suggestions, but nothing I think is needed prior to merge
@@ -207,7 +206,7 @@ impl<S: SimplifyInfo> ExprSimplifier<S> { | |||
/// ( | |||
/// col("x"), | |||
/// NullableInterval::NotNull { | |||
/// values: Interval::make(Some(3_i64), Some(5_i64), (false, false)), | |||
/// values: Interval::make(Some(3_i64), Some(5_i64)).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is much nicer (avoid the explicit open/closed boundaries)
@@ -3281,7 +3279,7 @@ mod tests { | |||
( | |||
col("c3"), | |||
NullableInterval::NotNull { | |||
values: Interval::make(Some(0_i64), Some(2_i64), (false, false)), | |||
values: Interval::make(Some(0_i64), Some(2_i64)).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the presence of Interval::make
and Interval::try_new
confusing as they both do the same thing but looking at this code I wonder "why sometimes use make and sometimes try_new?".
Would it be possible to remove all calls t Interval::make
and instead call Interval::try_new
(could be a follow on PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make
is a utility function for use in tests only. Using try_new
in test functions balloons the line count 🙂 I am adding a comment in the docstring to make this clear.
BTW I tried a few tricks to make this a test-only function (using config directives, moving to a separate test_utils.rs
file etc.), but it is not very straightforward since this is used in tests of multiple crates. One way I was able to make it work was via feature flags, but since that is a more involved change I decided not to put it in this PR.
@@ -262,18 +260,18 @@ mod tests { | |||
#[test] | |||
fn test_inequalities_non_null_bounded() { | |||
let guarantees = vec![ | |||
// x ∈ (1, 3] (not null) | |||
// x ∈ [1, 3] (not null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wjones127 (as I think you added this code originally in #7467)
@@ -34,13 +34,16 @@ path = "src/lib.rs" | |||
|
|||
[features] | |||
crypto_expressions = ["md-5", "sha2", "blake2", "blake3"] | |||
default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "encoding_expressions"] | |||
default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "encoding_expressions", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a strange reformatting (as is the reformatting in the other .toml files) but I don't see anything wrong with it.
If you can figure out how to avoid such changes I think it would result in smaller diffs that are easier to review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this kind of thing is distracting. Must be some VS code config thing, we'll share if we figure out why/how it turns up.
.unwrap_or(empty_field), | ||
), | ||
); | ||
let interval = Interval::try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer
} | ||
} | ||
|
||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for half the file being tests
/// }; | ||
/// assert_eq!(interval.single_value(), None); | ||
/// ``` | ||
pub fn single_value(&self) -> Option<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You could probably return Option<&ScalarValue>
here to avoid the clone in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at this, but the constructor call in the Null
variant creates friction.
// In IEEE 754 standard for floating-point arithmetic, if we keep the sign and exponent fields same, | ||
// we can represent 4503599627370496+1 different numbers by changing the mantissa | ||
// (4503599627370496 = 2^52, since there are 52 bits in mantissa, and 2^23 = 8388608 for f32). | ||
let distinct_f64 = 4503599627370497; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks like all floating point cardinality estimate is basically a large constant value I think it is not very valuable for use in cost models. It seems like there non trivial amounts of code that implies the result will be important (for example
pub fn cardinality(&self) -> Option<u64> {
let data_type = self.data_type();
if data_type.is_integer() {
self.upper.distance(&self.lower).map(|diff| diff as u64)
} else if data_type.is_floating() {
// Negative numbers are sorted in the reverse order. To
// always have a positive difference after the subtraction,
// we perform following transformation:
match (&self.lower, &self.upper) {
(
ScalarValue::Float32(Some(lower)),
ScalarValue::Float32(Some(upper)),
) => {
let lower_bits = map_floating_point_order!(lower.to_bits(), u32);
let upper_bits = map_floating_point_order!(upper.to_bits(), u32);
Some((upper_bits - lower_bits) as u64)
}
(
ScalarValue::Float64(Some(lower)),
ScalarValue::Float64(Some(upper)),
) => {
let lower_bits = map_floating_point_order!(lower.to_bits(), u64);
let upper_bits = map_floating_point_order!(upper.to_bits(), u64);
let count = upper_bits - lower_bits;
(count != u64::MAX).then_some(count)
}
_ => None,
}
Given that, what do you think about simply special casing the values in this test (e.g. return 4503599627370497
) in the implementation as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That value in tests was carefully chosen to match exponents-of-two boundaries. It is just for ease of testing -- since IEEE 754 floats are logarithmically distributed in the normal zone but linearly distributed in the subnormal zone, one needs to resort to tricks like this during testing.
The actual implementation works in non-boundary-aligned cases too; it achieves generality by exploiting the bit-pattern ordering specification clause in IEEE 754.
I will leave comments both in the impl code and the test code to clarify this.
// Negative numbers are sorted in the reverse order. To | ||
// always have a positive difference after the subtraction, | ||
// we perform following transformation: | ||
match (&self.lower, &self.upper) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below -- on the surface this looks like it is doing something complex for floating point values but the output doesn't seem to be very sophisticated. I think we could simplify the code, as metioned below)
/// The `Interval` type represents a closed interval used for computing | ||
/// reliable bounds for mathematical expressions. | ||
/// | ||
/// Conventions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
I will merge this to unlock future work on statistics and other dependent work. If there is anything we missed, just drop a line here and we will address with a quick follow-on PR. |
Which issue does this PR close?
Closes #7883 .
Rationale for this change
This is a refactoring PR for the interval library
interval_arithmetic.rs
. The key points are:expr
crate to make it accessible to logical plans.cp_solver.rs
functions.What changes are included in this PR?
This PR includes the following changes:
OR
operator is supported.Are these changes tested?
Yes, both with existing tests and newly added tests for new features.
Are there any user-facing changes?