Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make eval_sql_where available to DefaultPredicateEvaluator #627

Merged
merged 17 commits into from
Jan 16, 2025

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Jan 8, 2025

What changes are proposed in this pull request?

Parquet footer skipping code includes (and uses) a helpful eval_sql_where method that handles NULL values in comparisons gracefully, by injecting null checking automatically into the predicate's evaluation. It turns out that capability is also useful for the other predicate evaluator implementations (especially now that partition pruning will likely rely on the default predicate evaluator). So we generalize the logic as the provided method PredicateEvaluator::eval_sql_where. In order to support that method, we also declare a new eval_scalar_is_null trait method, with appropriate implementations. This has the side effect adding support for literal null checks -- previously, only columns could be null-checked.

How was this change tested?

Replace the existing unit test for the parquet skipping evaluator with adapted versions for the default and stats skipping predicate evaluator, which respectively verify that the provided method works correctly in both bool-output and expression-output cases. The parquet skipping module version is removed because it is redundant -- the default evaluator exercises boolean output, and the data skipping evaluator exercises column resolution.

@@ -1,8 +1,7 @@
//! An implementation of parquet row group skipping using data skipping predicates over footer stats.
use crate::predicates::parquet_stats_skipping::{
Copy link
Collaborator Author

@scovich scovich Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Not sure how this out of order import escaped cargo fmt before now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's just flattening it and you moved the import anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't cargo fmt order imports alphabetically? If so, how did use crate::predicates end up before use crate::expressions?

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 81.86813% with 33 lines in your changes missing coverage. Please review.

Project coverage is 83.66%. Comparing base (76c65c8) to head (56b7351).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/predicates/tests.rs 43.13% 0 Missing and 29 partials ⚠️
kernel/src/scan/data_skipping/tests.rs 97.43% 2 Missing ⚠️
kernel/src/expressions/mod.rs 85.71% 1 Missing ⚠️
kernel/src/scan/data_skipping.rs 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #627      +/-   ##
==========================================
- Coverage   83.66%   83.66%   -0.01%     
==========================================
  Files          75       75              
  Lines       16909    16949      +40     
  Branches    16909    16949      +40     
==========================================
+ Hits        14147    14180      +33     
+ Misses       2099     2085      -14     
- Partials      663      684      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scovich scovich requested a review from roeap January 10, 2025 16:22
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 10, 2025
@scovich scovich removed the breaking-change Change that will require a version bump label Jan 10, 2025
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM this looks great, left a few comments/questions but mostly for my understanding :)

@@ -44,6 +44,9 @@ mod tests;
pub(crate) trait PredicateEvaluator {
type Output;

/// A (possibly inverted) scalar NULL test, e.g. `<value> IS [NOT] NULL`.
fn eval_scalar_is_null(&self, val: &Scalar, inverted: bool) -> Option<Self::Output>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry slight tangent: it feels like the IS [NOT] NULL implies that output = bool? For the default case it has output = bool but then for data skiping that output is an expr?

Copy link
Collaborator Author

@scovich scovich Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PredicateEvaluator is generic.

The default and parquet predicate evaluators (which do "direct" evaluation) do directly return bool.

The data skipping predicate evaluator is different -- it is "indirect" and translates a normal predicate expression into a data skipping predicate expression that can later be applied multiple times to different stats rows (by engine expression handler in prod, or with the default predicate evaluator in tests). In that case eval_scalar_is_null (like all methods) returns Option<Expr>. The returned expression will return boolean values once we do evaluate it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @scovich! to ensure my understanding: for example,

  1. default/parquet predicate eval would yield immediate evaluation to bool of col < 5
  2. whereas data skipping predicate eval would yield some expr like minValues.col < 5. the 'evaluation' of the predicate in PredicateEvaluator is more like a transformation in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly!

@@ -229,12 +237,121 @@ pub(crate) trait PredicateEvaluator {
Variadic(VariadicExpression { op, exprs }) => self.eval_variadic(*op, exprs, inverted),
}
}

/// Evaluates a predicate with SQL WHERE semantics.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXTREMELY helpful comment + examples :)

@@ -394,12 +393,12 @@ fn test_eval_is_null() {
let expr = Expression::literal(1);
expect_eq!(
filter.eval_unary(UnaryOperator::IsNull, &expr, true),
None,
Some(true),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catches!

/// AND(..., AND(NULL, TRUE, NULL), ...)
/// AND(..., NULL, ...)
/// ```
fn eval_sql_where(&self, filter: &Expr) -> Option<Self::Output> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay so we are moving this from ParquetStatsSkippingFilter to more general PredicateEvaluator and then replaced instances of needing the ParquetStatsSkippingFilter with just the a (Default)PredicateEvaluator and feeding it whatever stats data it needs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. There was nothing inherent to parquet stats skipping in the logic, that was just the place it happened to land first.

Comment on lines +335 to +336
self.eval_unary(UnaryOperator::IsNull, left, true),
self.eval_unary(UnaryOperator::IsNull, right, true),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: ahh yea now I really feel the usefulness of our new APIs in #646.. I found myself having to think through the inverted=true cases here and below...

Copy link
Collaborator Author

@scovich scovich Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this predicate eval code is a bit mind bending for sure. Inversion and generic Output type are really powerful but also a lot harder to grok than "normal" code.

I kept worrying about that while designing this code but:

  • I actually started with non-invertible versions everywhere, but the complexity blew up in a different way -- you had to manually enumerate all the cases in redundant code, which is way bulkier and way more error prone. I eventually gave up and introduced the inverted flag everywhere for a significant code simplification.
  • The generic output type allows to turn 2-3 very similar implementations into one, which vastly reduces duplication and bug surface (from getting one of the similar copies slightly wrong). Even in earlier versions of this PR, there were two implementations of the eval_sql_where logic -- one for boolean output and a different one for expression output. Eventually I realized they were logically equivalent, in spite of looking rather different, which is what allowed to hoist it all the way up to PredicateEvaluator and ditch the annoyingly separate SqlWherePredicateEvaluator trait I had been using.

So, given a choice between compact and robust but harder to understand, vs. easy to understand but redundant and error-prone... the former seemed like a net win in spite of the learning curve to new entrants.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely agree and if I could then turn this into a little request: I think this is super useful context, any change we could embed it into (doc)comment somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a doc comment on PredicateEvaluator about inversion, does it suffice or are there gaps?

/// # Inverted expression semantics
///
/// Because inversion (`NOT` operator) has special semantics and can often be optimized away by
/// pushing it down, most methods take an `inverted` flag. That allows operations like
/// [`UnaryOperator::Not`] to simply evaluate their operand with a flipped `inverted` flag.

I guess it doesn't directly speak to the complexity tho...

There is also a doc comment about the parametrized Output type on DataSkippingPredicateEvaluator:

/// The types involved in these operations are parameterized and implementation-specific. For
/// example, [`crate::engine::parquet_stats_skipping::ParquetStatsProvider`] directly evaluates data
/// skipping expressions and returns boolean results, while
/// [`crate::scan::data_skipping::DataSkippingPredicateCreator`] instead converts the input
/// predicate to a data skipping predicate that can be evaluated directly later.

Maybe I should move it to PredicateEvaluator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some editing, PTAL?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing review of eval_sql_where. Will give another pass with a closer look at the tests

Comment on lines 245 to 246
/// missing value and produces a NULL result. The resulting NULL does not allow data skipping,
/// which is looking for a FALSE result. Meanwhile, SQL WHERE semantics only keeps rows for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old documentation helped me understand some of the concepts better.

/// By default, [`apply_expr`] can produce unwelcome behavior for comparisons involving all-NULL
/// columns (e.g. `a == 10`), because the (legitimately NULL) min/max stats are interpreted as
/// stats-missing that produces a NULL data skipping result). The resulting NULL can "poison"
/// the entire expression, causing it to return NULL instead of FALSE that would allow skipping.

What I like about this:

  1. I think the old doc's wording is clearer about how a legitimate NULL value and a missing stats field both produce NULL. New docs instead say "the (legitimately) NULL value is interpreted the same as a missing value and produces a NULL result". I think "interpretted the same way" felt ambiguous to me.
  2. Communicates that the NULL propagates all the way to the top. > "NULL can poison the entire expression"
  3. Communicates that data skipping only happens on a FALSE, and we don't get it in a NULL case. The newer docs say "The resulting NULL does not allow data skipping, which is looking for a FALSE result". I think the "data skipping looking for FALSE result" is what threw me off.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I reworded to (hopefully) incorporate the best of both texts, PTAL?

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
@scovich scovich requested a review from OussamaSaoudi January 16, 2025 13:26
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments. Rly cool stuff! thx Ryan

@@ -57,6 +55,7 @@ impl<'a> RowGroupFilter<'a> {

/// Applies a filtering predicate to a row group. Return value false means to skip it.
fn apply(row_group: &'a RowGroupMetaData, predicate: &Expression) -> bool {
use crate::predicates::PredicateEvaluator as _;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: I didn't know this could be used to import something as unnamed. cool stuff!

///
/// By default, [`eval_expr`] behaves badly for comparisons involving NULL columns (e.g. `a <
/// 10` when `a` is NULL), because NULL values are interpreted as "stats missing" (= cannot
/// skip). This can "poison" the entire expression, causing it to return NULL instead of FALSE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more addition: Make it clear that we want to treat it as NULL instead of an invalid/stats-missing.
Given a = NULL
eval_expr(a < 10) = NULL
eval_sql_where(a < 10) = FALSE (desired behaviour)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked the text, should be clearer now?

// Semantics are the same for comparison inside OR inside AND
let expr = &Expr::or(FALSE, Expr::and(NULL, Expr::lt(col.clone(), VAL)));
expect_eq!(null_filter.eval_expr(expr, false), None, "{expr}");
expect_eq!(null_filter.eval_sql_where(expr), None, "{expr}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha so we weren't able to push it down into the or, even though false OR x == x

Copy link
Collaborator Author

@scovich scovich Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's precisely because false OR x is x that we don't bother to push the null check down through it.
It doesn't help the data skipping at all -- NULL is stronger than FALSE in OR, so OR(FALSE, NULL) is NULL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(adding as a code comment for posterity)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... now that you mention, maybe OR is also worth pushing down? If a is NULL and b is 100, then:

OR(a < 10, b < 20) 
= OR(NULL, FALSE) 
= NULL

vs.

OR(AND(a IS NOT NULL, a < 10), AND(b IS NOT NULL, b < 20))
= OR(AND(FALSE, NULL), AND(TRUE, FALSE))
= OR(FALSE, FALSE)
= FALSE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'd initially disregarded OR push downs thinking it would break correctness somehow. I didn't think much of it. Glad to see that theres an optimization here :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah same, nice find!

}
NoStats.eval_sql_where(predicate) == Some(false)
use crate::predicates::PredicateEvaluator as _;
DefaultPredicateEvaluator::from(EmptyColumnResolver).eval_sql_where(predicate) == Some(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this idea of column resolvers for testing SQL semantics is so neat!!

kernel/src/scan/data_skipping/tests.rs Outdated Show resolved Hide resolved
@OussamaSaoudi OussamaSaoudi self-requested a review January 16, 2025 22:24
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restamp, LGTM :)

@scovich scovich merged commit 8494126 into delta-io:main Jan 16, 2025
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants