-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make eval_sql_where available to DefaultPredicateEvaluator #627
Conversation
@@ -1,8 +1,7 @@ | |||
//! An implementation of parquet row group skipping using data skipping predicates over footer stats. | |||
use crate::predicates::parquet_stats_skipping::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Not sure how this out of order import escaped cargo fmt before now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's just flattening it and you moved the import anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't cargo fmt order imports alphabetically? If so, how did use crate::predicates
end up before use crate::expressions
?
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #627 +/- ##
==========================================
- Coverage 83.66% 83.66% -0.01%
==========================================
Files 75 75
Lines 16909 16949 +40
Branches 16909 16949 +40
==========================================
+ Hits 14147 14180 +33
+ Misses 2099 2085 -14
- Partials 663 684 +21 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM this looks great, left a few comments/questions but mostly for my understanding :)
@@ -44,6 +44,9 @@ mod tests; | |||
pub(crate) trait PredicateEvaluator { | |||
type Output; | |||
|
|||
/// A (possibly inverted) scalar NULL test, e.g. `<value> IS [NOT] NULL`. | |||
fn eval_scalar_is_null(&self, val: &Scalar, inverted: bool) -> Option<Self::Output>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry slight tangent: it feels like the IS [NOT] NULL
implies that output = bool? For the default case it has output = bool but then for data skiping that output is an expr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PredicateEvaluator
is generic.
The default and parquet predicate evaluators (which do "direct" evaluation) do directly return bool
.
The data skipping predicate evaluator is different -- it is "indirect" and translates a normal predicate expression into a data skipping predicate expression that can later be applied multiple times to different stats rows (by engine expression handler in prod, or with the default predicate evaluator in tests). In that case eval_scalar_is_null
(like all methods) returns Option<Expr>
. The returned expression will return boolean values once we do evaluate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @scovich! to ensure my understanding: for example,
- default/parquet predicate eval would yield immediate evaluation to bool of
col < 5
- whereas data skipping predicate eval would yield some expr like
minValues.col < 5
. the 'evaluation' of the predicate inPredicateEvaluator
is more like a transformation in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly!
@@ -229,12 +237,121 @@ pub(crate) trait PredicateEvaluator { | |||
Variadic(VariadicExpression { op, exprs }) => self.eval_variadic(*op, exprs, inverted), | |||
} | |||
} | |||
|
|||
/// Evaluates a predicate with SQL WHERE semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXTREMELY helpful comment + examples :)
@@ -394,12 +393,12 @@ fn test_eval_is_null() { | |||
let expr = Expression::literal(1); | |||
expect_eq!( | |||
filter.eval_unary(UnaryOperator::IsNull, &expr, true), | |||
None, | |||
Some(true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catches!
/// AND(..., AND(NULL, TRUE, NULL), ...) | ||
/// AND(..., NULL, ...) | ||
/// ``` | ||
fn eval_sql_where(&self, filter: &Expr) -> Option<Self::Output> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay so we are moving this from ParquetStatsSkippingFilter
to more general PredicateEvaluator
and then replaced instances of needing the ParquetStatsSkippingFilter
with just the a (Default)PredicateEvaluator
and feeding it whatever stats data it needs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. There was nothing inherent to parquet stats skipping in the logic, that was just the place it happened to land first.
self.eval_unary(UnaryOperator::IsNull, left, true), | ||
self.eval_unary(UnaryOperator::IsNull, right, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: ahh yea now I really feel the usefulness of our new APIs in #646.. I found myself having to think through the inverted=true cases here and below...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this predicate eval code is a bit mind bending for sure. Inversion and generic Output
type are really powerful but also a lot harder to grok than "normal" code.
I kept worrying about that while designing this code but:
- I actually started with non-invertible versions everywhere, but the complexity blew up in a different way -- you had to manually enumerate all the cases in redundant code, which is way bulkier and way more error prone. I eventually gave up and introduced the
inverted
flag everywhere for a significant code simplification. - The generic output type allows to turn 2-3 very similar implementations into one, which vastly reduces duplication and bug surface (from getting one of the similar copies slightly wrong). Even in earlier versions of this PR, there were two implementations of the
eval_sql_where
logic -- one for boolean output and a different one for expression output. Eventually I realized they were logically equivalent, in spite of looking rather different, which is what allowed to hoist it all the way up toPredicateEvaluator
and ditch the annoyingly separateSqlWherePredicateEvaluator
trait I had been using.
So, given a choice between compact and robust but harder to understand, vs. easy to understand but redundant and error-prone... the former seemed like a net win in spite of the learning curve to new entrants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completely agree and if I could then turn this into a little request: I think this is super useful context, any change we could embed it into (doc)comment somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a doc comment on PredicateEvaluator
about inversion, does it suffice or are there gaps?
/// # Inverted expression semantics
///
/// Because inversion (`NOT` operator) has special semantics and can often be optimized away by
/// pushing it down, most methods take an `inverted` flag. That allows operations like
/// [`UnaryOperator::Not`] to simply evaluate their operand with a flipped `inverted` flag.
I guess it doesn't directly speak to the complexity tho...
There is also a doc comment about the parametrized Output
type on DataSkippingPredicateEvaluator
:
/// The types involved in these operations are parameterized and implementation-specific. For
/// example, [`crate::engine::parquet_stats_skipping::ParquetStatsProvider`] directly evaluates data
/// skipping expressions and returns boolean results, while
/// [`crate::scan::data_skipping::DataSkippingPredicateCreator`] instead converts the input
/// predicate to a data skipping predicate that can be evaluated directly later.
Maybe I should move it to PredicateEvaluator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some editing, PTAL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing review of eval_sql_where
. Will give another pass with a closer look at the tests
kernel/src/predicates/mod.rs
Outdated
/// missing value and produces a NULL result. The resulting NULL does not allow data skipping, | ||
/// which is looking for a FALSE result. Meanwhile, SQL WHERE semantics only keeps rows for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old documentation helped me understand some of the concepts better.
/// By default, [`apply_expr`] can produce unwelcome behavior for comparisons involving all-NULL
/// columns (e.g. `a == 10`), because the (legitimately NULL) min/max stats are interpreted as
/// stats-missing that produces a NULL data skipping result). The resulting NULL can "poison"
/// the entire expression, causing it to return NULL instead of FALSE that would allow skipping.
What I like about this:
- I think the old doc's wording is clearer about how a legitimate NULL value and a missing stats field both produce NULL. New docs instead say "the (legitimately) NULL value is interpreted the same as a missing value and produces a NULL result". I think "interpretted the same way" felt ambiguous to me.
- Communicates that the NULL propagates all the way to the top. > "NULL can poison the entire expression"
- Communicates that data skipping only happens on a FALSE, and we don't get it in a NULL case. The newer docs say "The resulting NULL does not allow data skipping, which is looking for a FALSE result". I think the "data skipping looking for FALSE result" is what threw me off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I reworded to (hopefully) incorporate the best of both texts, PTAL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments. Rly cool stuff! thx Ryan
@@ -57,6 +55,7 @@ impl<'a> RowGroupFilter<'a> { | |||
|
|||
/// Applies a filtering predicate to a row group. Return value false means to skip it. | |||
fn apply(row_group: &'a RowGroupMetaData, predicate: &Expression) -> bool { | |||
use crate::predicates::PredicateEvaluator as _; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I didn't know this could be used to import something as unnamed. cool stuff!
kernel/src/predicates/mod.rs
Outdated
/// | ||
/// By default, [`eval_expr`] behaves badly for comparisons involving NULL columns (e.g. `a < | ||
/// 10` when `a` is NULL), because NULL values are interpreted as "stats missing" (= cannot | ||
/// skip). This can "poison" the entire expression, causing it to return NULL instead of FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more addition: Make it clear that we want to treat it as NULL instead of an invalid/stats-missing.
Given a = NULL
eval_expr(a < 10) = NULL
eval_sql_where(a < 10) = FALSE
(desired behaviour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked the text, should be clearer now?
kernel/src/predicates/tests.rs
Outdated
// Semantics are the same for comparison inside OR inside AND | ||
let expr = &Expr::or(FALSE, Expr::and(NULL, Expr::lt(col.clone(), VAL))); | ||
expect_eq!(null_filter.eval_expr(expr, false), None, "{expr}"); | ||
expect_eq!(null_filter.eval_sql_where(expr), None, "{expr}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha so we weren't able to push it down into the or, even though false OR x == x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's precisely because false OR x
is x
that we don't bother to push the null check down through it.
It doesn't help the data skipping at all -- NULL
is stronger than FALSE
in OR
, so OR(FALSE, NULL)
is NULL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding as a code comment for posterity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... now that you mention, maybe OR is also worth pushing down? If a
is NULL and b
is 100, then:
OR(a < 10, b < 20)
= OR(NULL, FALSE)
= NULL
vs.
OR(AND(a IS NOT NULL, a < 10), AND(b IS NOT NULL, b < 20))
= OR(AND(FALSE, NULL), AND(TRUE, FALSE))
= OR(FALSE, FALSE)
= FALSE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I'd initially disregarded OR push downs thinking it would break correctness somehow. I didn't think much of it. Glad to see that theres an optimization here :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah same, nice find!
} | ||
NoStats.eval_sql_where(predicate) == Some(false) | ||
use crate::predicates::PredicateEvaluator as _; | ||
DefaultPredicateEvaluator::from(EmptyColumnResolver).eval_sql_where(predicate) == Some(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this idea of column resolvers for testing SQL semantics is so neat!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restamp, LGTM :)
What changes are proposed in this pull request?
Parquet footer skipping code includes (and uses) a helpful
eval_sql_where
method that handles NULL values in comparisons gracefully, by injecting null checking automatically into the predicate's evaluation. It turns out that capability is also useful for the other predicate evaluator implementations (especially now that partition pruning will likely rely on the default predicate evaluator). So we generalize the logic as the provided methodPredicateEvaluator::eval_sql_where
. In order to support that method, we also declare a neweval_scalar_is_null
trait method, with appropriate implementations. This has the side effect adding support for literal null checks -- previously, only columns could be null-checked.How was this change tested?
Replace the existing unit test for the parquet skipping evaluator with adapted versions for the default and stats skipping predicate evaluator, which respectively verify that the provided method works correctly in both bool-output and expression-output cases. The parquet skipping module version is removed because it is redundant -- the default evaluator exercises boolean output, and the data skipping evaluator exercises column resolution.