Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LiteralGuarantee on columns to extract conditions required for PhysicalExpr expressions to evaluate to true #8437

Merged
merged 16 commits into from
Dec 18, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 6, 2023

Which issue does this PR close?

Part of #8376 (see the POC PR #8397 for how this all fits together)

Broken out for easier PR review. I will update the bloom filter code to actually use this as a follow on PR

Rationale for this change

I am generalizing pruning a set of files/row_groups/containers based on information known before looking at the data (aka "pruning predicates") to support bloom filters and other similar structures which can tell if a given value is present/present in a given container (see #8376).

Part of this analysis requires identifying if a given predicate ensures that a given column must take a a literal value (aka a constant) or set of values. Given this is an important building block and non trivial analysis, I wanted to make it a first class concept in DataFusion.

Example of what the code in this PR can do

This PR adds the generic analysis of what, if any, single column equality guarantees of a boolean Expr must hold true for the Expr itself to hold true.

For example, for the filter expression (host = 'host1' OR host = 'host2') AND region != 'us-east-1' AND ... it can programmatically extract the intuition that the predicate can not be true if

  1. host is not one of ('host1', 'host2')
  2. region is 'us-east-1'

What changes are included in this PR?

  1. refactored out the code in the BloomFilterPruningPredicate originally written by @haohuaijin into its own struct
  2. Made it more general
  3. Added many tests

Are these changes tested?

Yes, this code is mostly new tests

Are there any user-facing changes?

Not yet

@alamb alamb marked this pull request as ready for review December 6, 2023 12:44
@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 6, 2023
@alamb alamb changed the title Add LiteralGurantee to extract if any PhysicalExpr is know to be a literal (or a set of literals) Add LiteralGuarantee to extract if any PhysicalExpr is know to be a literal (or a set of literals) Dec 6, 2023
/// AND a != 2)` or `a NOT IN (1, 2, 3)`
///
#[derive(Debug, Clone, PartialEq)]
pub struct LiteralGuarantee {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generalization of the code in the BloomFilterPruningPredicate

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation and tests are very concrete 👍

left.downcast_ref::<crate::expressions::Column>(),
right.downcast_ref::<crate::expressions::Literal>(),
) {
Some(Self { col, op, lit })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call Operator::swap() as the operator is reversed? Although only Eq and NotEq will occur here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call -- done in af13798

Comment on lines 112 to 116
// if not all terms are of the form (col <op> literal),
// can't infer anything
if terms.len() != disjunctions.len() {
return builder;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check necessary? We can infer the guarantee for expr like below, as dropping a part of OR-chain the condition still holds

... AND (
  a != 1 OR
  b > 1 OR
  a != 2
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an excellent point @waynexia. In fact I think this logic could potentially be improved even more

For example a = 1 OR a != 2 could actually provide two guarantees that could prove the predicate was false

  1. a != 1
  2. a = 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a lot about this last night and this morning -- at the core what this code is trying to do is help prove an expression can not be true (so rows can be filtered)

To do so, it is trying to provide one of the following guarantees:

  1. For the expression to be true, a column can ONLY be one of a set of constants
  2. For the expression to be true a column can NOT BE ANY of a set of constants

For the expression

  a != 1 OR
  b > 1 OR
  a != 2

We can't prove that the expression is true ONLY if a is not in (1, 2) -- it could be true if b was greater than 1 as well.

This subtelty was not clear from the comments so I have:

  1. Updated the comments
  2. Added more tests that illustrate this type of situation

I am pretty sure the new tests have uncovered a bug which I am working on fixing .

Thanks again for raising this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to look at this tomorrow closely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank k you -- I plan to have updated the code again by that time

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

Thank you @waynexia -- your comments are very helpful. I am going to incorporate your thoughts and suggestions and will reopen this PR for review when ready

@alamb alamb marked this pull request as draft December 12, 2023 20:48
@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

BTW this PR is not ready for re-review yet. I will try and get it ready as soon as possible but I fear I will run out of time today

@alamb alamb marked this pull request as ready for review December 15, 2023 14:08
@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2023

Ok, I have updated the documentation and tests more and I think this is ready for another review

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that this is a very tricky PR because of the guarantee reasoning. Thanks @alamb for all the detailed comments and tests. I have gone over all the tests carefully. Very nice and great cover. I just have a question for one test and suggest to add one more test

/// 2. The column must NOT be one of a particular set of values for the predicate to
/// be `true`. If the column takes on any of the values, the predicate can not
/// evaluate to `true`. For example,
/// `(a != 1)`, `(a != 1 AND a != 2)` or `a NOT IN (1, 2, 3)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detailed comments help to review the code 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps a lot 👍 The guarantee is similar to bloom filter: it might be false-positive, but negative result is reliable.

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
test_analyze(
col("b")
.not_eq(lit(1))
.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to test one in the above comment // b != 1 AND (b > 2)

Suggested change
.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))),
.and(col("b").gt(lit(2))),

Or // b != 1 AND (b > 2 OR b = 3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch -- I have updated it

.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))),
vec![
// for the expression to be true, b can not be one
not_in_guarantee("b", [1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that I need to think hard how Bloom filter works to review these. I stared at this for a while and this is what I think. I may misunderstand it though

If the test is b !=1 and b > 2, what is the right result here? I think it should be an empty vector, right?
My reasoning is if b = 0, then the not_in_guarantee("b", [1]) is true but the expression b !=1 and b > 2 is false.

I think it is the same for b != 1 AND (b > 2 OR b = 3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for using this for bloom filters is to try and prove that this expression is not true:

b != 1 AND (b > 2)

In order for the expression to be true, both of these have to hold

  1. `b != 1
  2. b > 2

Thus, if we can prove either does not hold, we can prove the expression is not true

So therefore, to prove the expression can't possibly be true, we need to prove that b ONLY takes the value of 1.

But as you point out this is not what the documentation says at the moment

/// 2. The column must NOT be one of a particular set of values for the predicate to
/// be `true`. If the column takes on any of the values, the predicate can not
/// evaluate to `true`. For example,
/// `(a != 1)`, `(a != 1 AND a != 2)` or `a NOT IN (1, 2, 3)`

It should probably say

/// 2. The column must NOT be one of a particular set of values for the predicate to
/// be `true`. If the column can ONLY take one of these values, the predicate can not
/// evaluate to `true`. 

🤔 I need to think about this more deeply

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my understanding: as guarantee is allowed to be false positive, it's fine to drop arbitrary parts of AND in a expr -- we only need to make sure the false condition is reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify more in f65a1dd

For the not_in_garantee, to prove the expression is false, we have to show that the column can only be one of the set of values.

.and(col("a").eq(lit("baz"))),
// this could potentially be represented as 2 constraints with a more
// sophisticated analysis
vec![],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

match existing.guarantee {
// knew that the column could not be a set of values
// For example, if we previously had `a != 5` and now we see another `a != 6` we know
// if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfinished sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- fixed.

} else {
// this is like (a != foo AND (a != bar OR a != baz)).
// We can't combine the (a!=bar OR a!=baz) part, but it
// also doesn't invalidate a != foo guarantee.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb and @NGA-TRAN, I think this PR is excellent 💯. The guarantee logic makes fully sense to me

/// 2. The column must NOT be one of a particular set of values for the predicate to
/// be `true`. If the column takes on any of the values, the predicate can not
/// evaluate to `true`. For example,
/// `(a != 1)`, `(a != 1 AND a != 2)` or `a NOT IN (1, 2, 3)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps a lot 👍 The guarantee is similar to bloom filter: it might be false-positive, but negative result is reliable.

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
@alamb alamb changed the title Add LiteralGuarantee to extract if any PhysicalExpr is know to be a literal (or a set of literals) Add LiteralGuarantee on columns to extract conditions required for PhysicalExpr expressions to evaluate to true Dec 16, 2023
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NGA-TRAN and @waynexia for the careful reviews. It is much appreciated.

cc @wjones127 who might be interested as this is similar to the guarantees used to simplify expressions, added in #7467

.and(col("b").gt(lit(2)).or(col("b").eq(lit(3)))),
vec![
// for the expression to be true, b can not be one
not_in_guarantee("b", [1]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify more in f65a1dd

For the not_in_garantee, to prove the expression is false, we have to show that the column can only be one of the set of values.

@alamb
Copy link
Contributor Author

alamb commented Dec 18, 2023

I plan to merge this in later today unless I hear any more comments

@alamb alamb merged commit b5e94a6 into apache:main Dec 18, 2023
22 checks passed
@alamb alamb deleted the alamb/literal_guarantee branch December 18, 2023 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants