Make eval_sql_where available to DefaultPredicateEvaluator #627

scovich · 2025-01-08T22:26:06Z

What changes are proposed in this pull request?

Parquet footer skipping code includes (and uses) a helpful eval_sql_where method that handles NULL values in comparisons gracefully, by injecting null checking automatically into the predicate's evaluation. It turns out that capability is also useful for the other predicate evaluator implementations (especially now that partition pruning will likely rely on the default predicate evaluator). So we generalize the logic as the provided method PredicateEvaluator::eval_sql_where. In order to support that method, we also declare a new eval_scalar_is_null trait method, with appropriate implementations. This has the side effect adding support for literal null checks -- previously, only columns could be null-checked.

How was this change tested?

Replace the existing unit test for the parquet skipping evaluator with adapted versions for the default and stats skipping predicate evaluator, which respectively verify that the provided method works correctly in both bool-output and expression-output cases. The parquet skipping module version is removed because it is redundant -- the default evaluator exercises boolean output, and the data skipping evaluator exercises column resolution.

scovich · 2025-01-08T22:27:42Z

kernel/src/engine/parquet_row_group_skipping.rs

@@ -1,8 +1,7 @@
 //! An implementation of parquet row group skipping using data skipping predicates over footer stats.
-use crate::predicates::parquet_stats_skipping::{


aside: Not sure how this out of order import escaped cargo fmt before now?

Looks like it's just flattening it and you moved the import anyways.

Doesn't cargo fmt order imports alphabetically? If so, how did use crate::predicates end up before use crate::expressions?

kernel/src/predicates/mod.rs

codecov · 2025-01-08T23:22:36Z

Codecov Report

Attention: Patch coverage is 81.86813% with 33 lines in your changes missing coverage. Please review.

Project coverage is 83.66%. Comparing base (76c65c8) to head (56b7351).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/predicates/tests.rs	43.13%	0 Missing and 29 partials ⚠️
kernel/src/scan/data_skipping/tests.rs	97.43%	2 Missing ⚠️
kernel/src/expressions/mod.rs	85.71%	1 Missing ⚠️
kernel/src/scan/data_skipping.rs	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #627      +/-   ##
==========================================
- Coverage   83.66%   83.66%   -0.01%     
==========================================
  Files          75       75              
  Lines       16909    16949      +40     
  Branches    16909    16949      +40     
==========================================
+ Hits        14147    14180      +33     
+ Misses       2099     2085      -14     
- Partials      663      684      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zachschuermann

LGTM this looks great, left a few comments/questions but mostly for my understanding :)

zachschuermann · 2025-01-15T22:52:37Z

kernel/src/predicates/mod.rs

@@ -44,6 +44,9 @@ mod tests;
 pub(crate) trait PredicateEvaluator {
    type Output;

+    /// A (possibly inverted) scalar NULL test, e.g. `<value> IS [NOT] NULL`.
+    fn eval_scalar_is_null(&self, val: &Scalar, inverted: bool) -> Option<Self::Output>;


sorry slight tangent: it feels like the IS [NOT] NULL implies that output = bool? For the default case it has output = bool but then for data skiping that output is an expr?

PredicateEvaluator is generic.

The default and parquet predicate evaluators (which do "direct" evaluation) do directly return bool.

The data skipping predicate evaluator is different -- it is "indirect" and translates a normal predicate expression into a data skipping predicate expression that can later be applied multiple times to different stats rows (by engine expression handler in prod, or with the default predicate evaluator in tests). In that case eval_scalar_is_null (like all methods) returns Option<Expr>. The returned expression will return boolean values once we do evaluate it.

thanks @scovich! to ensure my understanding: for example,

default/parquet predicate eval would yield immediate evaluation to bool of col < 5

whereas data skipping predicate eval would yield some expr like minValues.col < 5. the 'evaluation' of the predicate in PredicateEvaluator is more like a transformation in that case?

zachschuermann · 2025-01-16T04:28:16Z

kernel/src/predicates/mod.rs

@@ -229,12 +237,121 @@ pub(crate) trait PredicateEvaluator {
            Variadic(VariadicExpression { op, exprs }) => self.eval_variadic(*op, exprs, inverted),
        }
    }
+
+    /// Evaluates a predicate with SQL WHERE semantics.


EXTREMELY helpful comment + examples :)

zachschuermann · 2025-01-16T04:35:16Z

kernel/src/predicates/tests.rs

@@ -394,12 +393,12 @@ fn test_eval_is_null() {
    let expr = Expression::literal(1);
    expect_eq!(
        filter.eval_unary(UnaryOperator::IsNull, &expr, true),
-        None,
+        Some(true),


nice catches!

zachschuermann · 2025-01-16T04:41:15Z

kernel/src/predicates/mod.rs

+    /// AND(..., AND(NULL, TRUE, NULL), ...)
+    /// AND(..., NULL, ...)
+    /// ```
+    fn eval_sql_where(&self, filter: &Expr) -> Option<Self::Output> {


okay so we are moving this from ParquetStatsSkippingFilter to more general PredicateEvaluator and then replaced instances of needing the ParquetStatsSkippingFilter with just the a (Default)PredicateEvaluator and feeding it whatever stats data it needs?

Exactly. There was nothing inherent to parquet stats skipping in the logic, that was just the place it happened to land first.

zachschuermann · 2025-01-16T04:43:49Z

kernel/src/predicates/mod.rs

+                    self.eval_unary(UnaryOperator::IsNull, left, true),
+                    self.eval_unary(UnaryOperator::IsNull, right, true),


aside: ahh yea now I really feel the usefulness of our new APIs in #646.. I found myself having to think through the inverted=true cases here and below...

Yeah this predicate eval code is a bit mind bending for sure. Inversion and generic Output type are really powerful but also a lot harder to grok than "normal" code.

I kept worrying about that while designing this code but:

I actually started with non-invertible versions everywhere, but the complexity blew up in a different way -- you had to manually enumerate all the cases in redundant code, which is way bulkier and way more error prone. I eventually gave up and introduced the inverted flag everywhere for a significant code simplification.

The generic output type allows to turn 2-3 very similar implementations into one, which vastly reduces duplication and bug surface (from getting one of the similar copies slightly wrong). Even in earlier versions of this PR, there were two implementations of the eval_sql_where logic -- one for boolean output and a different one for expression output. Eventually I realized they were logically equivalent, in spite of looking rather different, which is what allowed to hoist it all the way up to PredicateEvaluator and ditch the annoyingly separate SqlWherePredicateEvaluator trait I had been using.

So, given a choice between compact and robust but harder to understand, vs. easy to understand but redundant and error-prone... the former seemed like a net win in spite of the learning curve to new entrants.

completely agree and if I could then turn this into a little request: I think this is super useful context, any change we could embed it into (doc)comment somewhere?

There is already a doc comment on PredicateEvaluator about inversion, does it suffice or are there gaps?

/// # Inverted expression semantics /// /// Because inversion (`NOT` operator) has special semantics and can often be optimized away by /// pushing it down, most methods take an `inverted` flag. That allows operations like /// [`UnaryOperator::Not`] to simply evaluate their operand with a flipped `inverted` flag.

I guess it doesn't directly speak to the complexity tho...

There is also a doc comment about the parametrized Output type on DataSkippingPredicateEvaluator:

/// The types involved in these operations are parameterized and implementation-specific. For /// example, [`crate::engine::parquet_stats_skipping::ParquetStatsProvider`] directly evaluates data /// skipping expressions and returns boolean results, while /// [`crate::scan::data_skipping::DataSkippingPredicateCreator`] instead converts the input /// predicate to a data skipping predicate that can be evaluated directly later.

Maybe I should move it to PredicateEvaluator?

Did some editing, PTAL?

OussamaSaoudi

Flushing review of eval_sql_where. Will give another pass with a closer look at the tests

OussamaSaoudi · 2025-01-16T05:28:48Z

kernel/src/predicates/mod.rs

+    /// missing value and produces a NULL result. The resulting NULL does not allow data skipping,
+    /// which is looking for a FALSE result. Meanwhile, SQL WHERE semantics only keeps rows for


The old documentation helped me understand some of the concepts better.

/// By default, [`apply_expr`] can produce unwelcome behavior for comparisons involving all-NULL /// columns (e.g. `a == 10`), because the (legitimately NULL) min/max stats are interpreted as /// stats-missing that produces a NULL data skipping result). The resulting NULL can "poison" /// the entire expression, causing it to return NULL instead of FALSE that would allow skipping.

What I like about this:

I think the old doc's wording is clearer about how a legitimate NULL value and a missing stats field both produce NULL. New docs instead say "the (legitimately) NULL value is interpreted the same as a missing value and produces a NULL result". I think "interpretted the same way" felt ambiguous to me.

Communicates that the NULL propagates all the way to the top. > "NULL can poison the entire expression"

Communicates that data skipping only happens on a FALSE, and we don't get it in a NULL case. The newer docs say "The resulting NULL does not allow data skipping, which is looking for a FALSE result". I think the "data skipping looking for FALSE result" is what threw me off.

Thanks for the feedback. I reworded to (hopefully) incorporate the best of both texts, PTAL?

kernel/src/predicates/mod.rs

OussamaSaoudi

Couple comments. Rly cool stuff! thx Ryan

OussamaSaoudi · 2025-01-16T17:36:56Z

kernel/src/engine/parquet_row_group_skipping.rs

@@ -57,6 +55,7 @@ impl<'a> RowGroupFilter<'a> {

    /// Applies a filtering predicate to a row group. Return value false means to skip it.
    fn apply(row_group: &'a RowGroupMetaData, predicate: &Expression) -> bool {
+        use crate::predicates::PredicateEvaluator as _;


aside: I didn't know this could be used to import something as unnamed. cool stuff!

OussamaSaoudi · 2025-01-16T17:56:40Z

kernel/src/predicates/mod.rs

+    ///
+    /// By default, [`eval_expr`] behaves badly for comparisons involving NULL columns (e.g. `a <
+    /// 10` when `a` is NULL), because NULL values are interpreted as "stats missing" (= cannot
+    /// skip). This can "poison" the entire expression, causing it to return NULL instead of FALSE


One more addition: Make it clear that we want to treat it as NULL instead of an invalid/stats-missing.
Given a = NULL
eval_expr(a < 10) = NULL
eval_sql_where(a < 10) = FALSE (desired behaviour)

Reworked the text, should be clearer now?

OussamaSaoudi · 2025-01-16T20:28:28Z

kernel/src/predicates/tests.rs

+    // Semantics are the same for comparison inside OR inside AND
+    let expr = &Expr::or(FALSE, Expr::and(NULL, Expr::lt(col.clone(), VAL)));
+    expect_eq!(null_filter.eval_expr(expr, false), None, "{expr}");
+    expect_eq!(null_filter.eval_sql_where(expr), None, "{expr}");


aha so we weren't able to push it down into the or, even though false OR x == x

It's precisely because false OR x is x that we don't bother to push the null check down through it.
It doesn't help the data skipping at all -- NULL is stronger than FALSE in OR, so OR(FALSE, NULL) is NULL.

(adding as a code comment for posterity)

Hmm... now that you mention, maybe OR is also worth pushing down? If a is NULL and b is 100, then:

OR(a < 10, b < 20) = OR(NULL, FALSE) = NULL

vs.

OR(AND(a IS NOT NULL, a < 10), AND(b IS NOT NULL, b < 20)) = OR(AND(FALSE, NULL), AND(TRUE, FALSE)) = OR(FALSE, FALSE) = FALSE

Ah I'd initially disregarded OR push downs thinking it would break correctness somehow. I didn't think much of it. Glad to see that theres an optimization here :D

hah same, nice find!

OussamaSaoudi · 2025-01-16T20:31:19Z

kernel/src/scan/mod.rs

-    }
-    NoStats.eval_sql_where(predicate) == Some(false)
+    use crate::predicates::PredicateEvaluator as _;
+    DefaultPredicateEvaluator::from(EmptyColumnResolver).eval_sql_where(predicate) == Some(false)


Also this idea of column resolvers for testing SQL semantics is so neat!!

kernel/src/scan/data_skipping/tests.rs

OussamaSaoudi

LGTM!

zachschuermann

restamp, LGTM :)

Make eval_sql_where available to DefaultPredicateEvaluator

e7eb784

scovich requested review from nicklan and zachschuermann January 8, 2025 22:26

github-actions bot assigned scovich Jan 8, 2025

scovich mentioned this pull request Jan 8, 2025

partition skipping filter #624

Open

2 tasks

scovich commented Jan 8, 2025

View reviewed changes

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved

fix broken test, expand test coverage

d8df02e

scovich added 4 commits January 9, 2025 15:02

add SQL where support to data skipping eval as well

19887a5

clippy

4f677d4

generalize PredicateEvaluator::eval_sql_where

29110ad

Merge remote-tracking branch 'oss/main' into eval-sql-where

18ee5f9

scovich requested a review from roeap January 10, 2025 16:22

scovich added 4 commits January 10, 2025 08:23

fmt

5ccbd2c

improve doc comments

ae6ce08

bug fix and test fixes

0a23779

test improvements

6f75047

github-actions bot added the breaking-change Change that will require a version bump label Jan 10, 2025

scovich added 2 commits January 10, 2025 13:14

Switch data skipping to use SQL semantics

e0dc148

Merge remote-tracking branch 'oss/main' into eval-sql-where

e03705d

scovich removed the breaking-change Change that will require a version bump label Jan 10, 2025

Merge remote-tracking branch 'oss/main' into eval-sql-where

3b8a451

zachschuermann approved these changes Jan 16, 2025

View reviewed changes

OussamaSaoudi reviewed Jan 16, 2025

View reviewed changes

scovich added 2 commits January 16, 2025 05:07

Merge remote-tracking branch 'oss/main' into eval-sql-where

29b9713

address reviews

3d8e15d

scovich requested a review from OussamaSaoudi January 16, 2025 13:26

more review feedback

0d2462b

OussamaSaoudi approved these changes Jan 16, 2025

View reviewed changes

Support OR as well

56b7351

OussamaSaoudi self-requested a review January 16, 2025 22:24

OussamaSaoudi approved these changes Jan 16, 2025

View reviewed changes

zachschuermann approved these changes Jan 16, 2025

View reviewed changes

scovich merged commit 8494126 into delta-io:main Jan 16, 2025
20 of 21 checks passed

		@@ -1,8 +1,7 @@
		//! An implementation of parquet row group skipping using data skipping predicates over footer stats.
		use crate::predicates::parquet_stats_skipping::{

		self.eval_unary(UnaryOperator::IsNull, left, true),
		self.eval_unary(UnaryOperator::IsNull, right, true),

		/// missing value and produces a NULL result. The resulting NULL does not allow data skipping,
		/// which is looking for a FALSE result. Meanwhile, SQL WHERE semantics only keeps rows for

Make eval_sql_where available to DefaultPredicateEvaluator #627

Make eval_sql_where available to DefaultPredicateEvaluator #627

Conversation

scovich commented Jan 8, 2025 • edited Loading

What changes are proposed in this pull request?

How was this change tested?

scovich Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 8, 2025 • edited Loading

Codecov Report

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi left a comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

scovich commented Jan 8, 2025 •

edited

Loading

scovich Jan 8, 2025 •

edited

Loading

codecov bot commented Jan 8, 2025 •

edited

Loading

scovich Jan 16, 2025 •

edited

Loading

scovich Jan 16, 2025 •

edited

Loading

scovich Jan 16, 2025 •

edited

Loading