Refactor Interval Arithmetic Updates #8276

berkaysynnada · 2023-11-20T12:49:52Z

Which issue does this PR close?

Closes #7883 .

Rationale for this change

This is a refactoring PR for the interval library interval_arithmetic.rs. The key points are:

Move the interval library to the expr crate to make it accessible to logical plans.
Do we really need a bound type? Adopting a convention of always using open or always closed intervals may simplify usage.
There should be no possible ways to create invalid intervals.
There must be no multiple representations of the same intervals.
Support for multiplication (Mul) and division (Div).
Improved cardinality calculation.
Improved overflow handling.
Expanded test coverage.
Support for temporal types.
Propagation of boolean intervals through comparison and equality operators.
A more understandable API and documentation for the library and cp_solver.rs functions.

What changes are included in this PR?

This PR includes the following changes:

Interval bounds are now always closed. For strict equality cases, we use one-step increase or decrease utilities.
There is a single method for creating intervals, which always returns valid and standardized intervals.
Added support for multiplication and division.
Overflows are handled more intelligently.
Cardinality calculation has been simplified and made more accurate.
Temporal types are now supported.
Boolean intervals can be propagated through logical operators.
OR operator is supported.
Documentation has been improved. The new structure makes future implementations easier.

Are these changes tested?

Yes, both with existing tests and newly added tests for new features.

Are there any user-facing changes?

…al-arithmetic-updates

ozankabak

I reviewed this very carefully and collaborated with @berkaysynnada closely on this. Since this will be a foundational utility for many use cases, we are looking forward to reviews from other pairs of eyes as well -- @alamb, PTAL

alamb · 2023-11-20T13:29:41Z

I will try to review this carefully over the next day or two, but given its size I don't think I will be able to provide a super detailed review. However, since @ozankabak has already done so I feel it is in good hands.

I will focus on the interaction with other components / test updates.

berkaysynnada · 2023-11-20T13:40:52Z

There is such a diff size because we moved the main code. The key changes are in interval_arithmetic.rs and cp_solver.rs. We have carefully reviewed and thoroughly tested the functions for arithmetic operations on intervals. Reviewing from a higher-level perspective might be more time-efficient.

alamb

Thank you @berkaysynnada -- I think this API is much nicer.

I clearly wasn't able to give the whole thing a thorough review, but I did review test cases (outside the interval implementation itself) that were changed as well as the API where it interacted with the rest of the system

All in all I think it is really nicely done.

I had some additional improvement suggestions, but nothing I think is needed prior to merge

alamb · 2023-11-20T20:21:04Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

@@ -207,7 +206,7 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
    ///    (
    ///        col("x"),
    ///        NullableInterval::NotNull {
-    ///            values: Interval::make(Some(3_i64), Some(5_i64), (false, false)),
+    ///            values: Interval::make(Some(3_i64), Some(5_i64)).unwrap()


this is much nicer (avoid the explicit open/closed boundaries)

alamb · 2023-11-20T20:25:32Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

@@ -3281,7 +3279,7 @@ mod tests {
            (
                col("c3"),
                NullableInterval::NotNull {
-                    values: Interval::make(Some(0_i64), Some(2_i64), (false, false)),
+                    values: Interval::make(Some(0_i64), Some(2_i64)).unwrap(),


I found the presence of Interval::make and Interval::try_new confusing as they both do the same thing but looking at this code I wonder "why sometimes use make and sometimes try_new?".

Would it be possible to remove all calls t Interval::make and instead call Interval::try_new (could be a follow on PR)

make is a utility function for use in tests only. Using try_new in test functions balloons the line count 🙂 I am adding a comment in the docstring to make this clear.

BTW I tried a few tricks to make this a test-only function (using config directives, moving to a separate test_utils.rs file etc.), but it is not very straightforward since this is used in tests of multiple crates. One way I was able to make it work was via feature flags, but since that is a more involved change I decided not to put it in this PR.

alamb · 2023-11-20T20:26:55Z

datafusion/optimizer/src/simplify_expressions/guarantees.rs

@@ -262,18 +260,18 @@ mod tests {
    #[test]
    fn test_inequalities_non_null_bounded() {
        let guarantees = vec![
-            // x ∈ (1, 3] (not null)
+            // x ∈ [1, 3] (not null)


cc @wjones127 (as I think you added this code originally in #7467)

alamb · 2023-11-20T20:29:10Z

datafusion/physical-expr/Cargo.toml

@@ -34,13 +34,16 @@ path = "src/lib.rs"

 [features]
 crypto_expressions = ["md-5", "sha2", "blake2", "blake3"]
-default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "encoding_expressions"]
+default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "encoding_expressions",


this is a strange reformatting (as is the reformatting in the other .toml files) but I don't see anything wrong with it.

If you can figure out how to avoid such changes I think it would result in smaller diffs that are easier to review

Agreed, this kind of thing is distracting. Must be some VS code config thing, we'll share if we figure out why/how it turns up.

alamb · 2023-11-20T20:29:29Z

datafusion/physical-expr/src/analysis.rs

-                    .unwrap_or(empty_field),
-            ),
-        );
+        let interval = Interval::try_new(


alamb · 2023-11-20T20:40:06Z