-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LiteralGuarantee
on columns to extract conditions required for PhysicalExpr
expressions to evaluate to true
#8437
Conversation
LiteralGurantee
to extract if any PhysicalExpr is know to be a literal (or a set of literals)LiteralGuarantee
to extract if any PhysicalExpr is know to be a literal (or a set of literals)
/// AND a != 2)` or `a NOT IN (1, 2, 3)` | ||
/// | ||
#[derive(Debug, Clone, PartialEq)] | ||
pub struct LiteralGuarantee { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a generalization of the code in the BloomFilterPruningPredicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation and tests are very concrete 👍
left.downcast_ref::<crate::expressions::Column>(), | ||
right.downcast_ref::<crate::expressions::Literal>(), | ||
) { | ||
Some(Self { col, op, lit }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call Operator::swap()
as the operator is reversed? Although only Eq
and NotEq
will occur here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call -- done in af13798
// if not all terms are of the form (col <op> literal), | ||
// can't infer anything | ||
if terms.len() != disjunctions.len() { | ||
return builder; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check necessary? We can infer the guarantee for expr like below, as dropping a part of OR-chain the condition still holds
... AND (
a != 1 OR
b > 1 OR
a != 2
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an excellent point @waynexia. In fact I think this logic could potentially be improved even more
For example a = 1 OR a != 2
could actually provide two guarantees that could prove the predicate was false
a != 1
a = 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a lot about this last night and this morning -- at the core what this code is trying to do is help prove an expression can not be true
(so rows can be filtered)
To do so, it is trying to provide one of the following guarantees:
- For the expression to be
true
, a column can ONLY be one of a set of constants - For the expression to be
true
a column can NOT BE ANY of a set of constants
For the expression
a != 1 OR
b > 1 OR
a != 2
We can't prove that the expression is true ONLY if a is not in (1, 2)
-- it could be true if b was greater than 1 as well.
This subtelty was not clear from the comments so I have:
- Updated the comments
- Added more tests that illustrate this type of situation
I am pretty sure the new tests have uncovered a bug which I am working on fixing .
Thanks again for raising this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to look at this tomorrow closely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank k you -- I plan to have updated the code again by that time
Thank you @waynexia -- your comments are very helpful. I am going to incorporate your thoughts and suggestions and will reopen this PR for review when ready |
BTW this PR is not ready for re-review yet. I will try and get it ready as soon as possible but I fear I will run out of time today |
Ok, I have updated the documentation and tests more and I think this is ready for another review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit that this is a very tricky PR because of the guarantee reasoning. Thanks @alamb for all the detailed comments and tests. I have gone over all the tests carefully. Very nice and great cover. I just have a question for one test and suggest to add one more test
/// 2. The column must NOT be one of a particular set of values for the predicate to | ||
/// be `true`. If the column takes on any of the values, the predicate can not | ||
/// evaluate to `true`. For example, | ||
/// `(a != 1)`, `(a != 1 AND a != 2)` or `a NOT IN (1, 2, 3)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The detailed comments help to review the code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps a lot 👍 The guarantee is similar to bloom filter: it might be false-positive, but negative result is reliable.
test_analyze( | ||
col("b") | ||
.not_eq(lit(1)) | ||
.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to test one in the above comment // b != 1 AND (b > 2)
.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))), | |
.and(col("b").gt(lit(2))), |
Or // b != 1 AND (b > 2 OR b = 3)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch -- I have updated it
.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))), | ||
vec![ | ||
// for the expression to be true, b can not be one | ||
not_in_guarantee("b", [1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit that I need to think hard how Bloom filter works to review these. I stared at this for a while and this is what I think. I may misunderstand it though
If the test is b !=1 and b > 2
, what is the right result here? I think it should be an empty vector, right?
My reasoning is if b = 0
, then the not_in_guarantee("b", [1])
is true but the expression b !=1 and b > 2
is false.
I think it is the same for b != 1 AND (b > 2 OR b = 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea for using this for bloom filters is to try and prove that this expression is not true:
b != 1 AND (b > 2)
In order for the expression to be true, both of these have to hold
- `b != 1
- b > 2
Thus, if we can prove either does not hold, we can prove the expression is not true
So therefore, to prove the expression can't possibly be true, we need to prove that b ONLY takes the value of 1
.
But as you point out this is not what the documentation says at the moment
/// 2. The column must NOT be one of a particular set of values for the predicate to
/// be `true`. If the column takes on any of the values, the predicate can not
/// evaluate to `true`. For example,
/// `(a != 1)`, `(a != 1 AND a != 2)` or `a NOT IN (1, 2, 3)`
It should probably say
/// 2. The column must NOT be one of a particular set of values for the predicate to
/// be `true`. If the column can ONLY take one of these values, the predicate can not
/// evaluate to `true`.
🤔 I need to think about this more deeply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my understanding: as guarantee is allowed to be false positive, it's fine to drop arbitrary parts of AND
in a expr -- we only need to make sure the false
condition is reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify more in f65a1dd
For the not_in_garantee
, to prove the expression is false, we have to show that the column can only be one of the set of values.
.and(col("a").eq(lit("baz"))), | ||
// this could potentially be represented as 2 constraints with a more | ||
// sophisticated analysis | ||
vec![], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
match existing.guarantee { | ||
// knew that the column could not be a set of values | ||
// For example, if we previously had `a != 5` and now we see another `a != 6` we know | ||
// if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfinished sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- fixed.
} else { | ||
// this is like (a != foo AND (a != bar OR a != baz)). | ||
// We can't combine the (a!=bar OR a!=baz) part, but it | ||
// also doesn't invalidate a != foo guarantee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Nga Tran <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// 2. The column must NOT be one of a particular set of values for the predicate to | ||
/// be `true`. If the column takes on any of the values, the predicate can not | ||
/// evaluate to `true`. For example, | ||
/// `(a != 1)`, `(a != 1 AND a != 2)` or `a NOT IN (1, 2, 3)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps a lot 👍 The guarantee is similar to bloom filter: it might be false-positive, but negative result is reliable.
LiteralGuarantee
to extract if any PhysicalExpr is know to be a literal (or a set of literals)LiteralGuarantee
on columns to extract conditions required for PhysicalExpr
expressions to evaluate to true
Co-authored-by: Ruihang Xia <[email protected]>
…fusion into alamb/literal_guarantee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @NGA-TRAN and @waynexia for the careful reviews. It is much appreciated.
cc @wjones127 who might be interested as this is similar to the guarantees used to simplify expressions, added in #7467
.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))), | ||
vec![ | ||
// for the expression to be true, b can not be one | ||
not_in_guarantee("b", [1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify more in f65a1dd
For the not_in_garantee
, to prove the expression is false, we have to show that the column can only be one of the set of values.
I plan to merge this in later today unless I hear any more comments |
Which issue does this PR close?
Part of #8376 (see the POC PR #8397 for how this all fits together)
Broken out for easier PR review. I will update the bloom filter code to actually use this as a follow on PR
Rationale for this change
I am generalizing pruning a set of files/row_groups/containers based on information known before looking at the data (aka "pruning predicates") to support bloom filters and other similar structures which can tell if a given value is present/present in a given container (see #8376).
Part of this analysis requires identifying if a given predicate ensures that a given column must take a a literal value (aka a constant) or set of values. Given this is an important building block and non trivial analysis, I wanted to make it a first class concept in DataFusion.
Example of what the code in this PR can do
This PR adds the generic analysis of what, if any, single column equality guarantees of a boolean
Expr
must hold true for theExpr
itself to hold true.For example, for the filter expression
(host = 'host1' OR host = 'host2') AND region != 'us-east-1' AND ...
it can programmatically extract the intuition that the predicate can not be true ifhost
is not one of ('host1'
,'host2'
)region
is'us-east-1'
What changes are included in this PR?
BloomFilterPruningPredicate
originally written by @haohuaijin into its own structAre these changes tested?
Yes, this code is mostly new tests
Are there any user-facing changes?
Not yet