Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(frontend): Supports cut OR condition and push down to storage #19812

Merged
merged 14 commits into from
Dec 24, 2024

Conversation

Li0k
Copy link
Contributor

@Li0k Li0k commented Dec 16, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

related to #19525

This PR implements the optimizations mentioned in the issue

  1. try to merge scan_range with same eq.
  2. try to push down multiple scan ranges instead of full table scan

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@Li0k Li0k requested review from chenzl25 and xxhZs December 16, 2024 12:05
@Li0k Li0k changed the title feat(batch): Supports cut OR condition and push down to storage WIP: feat(batch): Supports cut OR condition and push down to storage Dec 16, 2024
@Li0k Li0k marked this pull request as ready for review December 16, 2024 12:08
@graphite-app graphite-app bot requested a review from a team December 16, 2024 12:28
@Li0k Li0k changed the title WIP: feat(batch): Supports cut OR condition and push down to storage WIP: feat(frontend): Supports cut OR condition and push down to storage Dec 17, 2024
@Li0k Li0k changed the title WIP: feat(frontend): Supports cut OR condition and push down to storage feat(frontend): Supports cut OR condition and push down to storage Dec 18, 2024
@Li0k Li0k requested a review from st1page December 18, 2024 03:40
@@ -418,7 +418,7 @@
batch_plan: |-
BatchExchange { order: [], dist: Single }
└─BatchFilter { predicate: (((orders_count_by_user.user_id = 1:Int32) OR ((orders_count_by_user.user_id = 2:Int32) AND In(orders_count_by_user.date, 1111:Int32, 2222:Int32))) OR (orders_count_by_user.user_id <> 3:Int32)) }
└─BatchScan { table: orders_count_by_user, columns: [orders_count_by_user.user_id, orders_count_by_user.date, orders_count_by_user.orders_count], distribution: UpstreamHashShard(orders_count_by_user.user_id, orders_count_by_user.date) }
└─BatchScan { table: orders_count_by_user, columns: [orders_count_by_user.user_id, orders_count_by_user.date, orders_count_by_user.orders_count], scan_ranges: [orders_count_by_user.user_id = Int64(1), orders_count_by_user.user_id = Int64(2) AND orders_count_by_user.date = Int32(1111), orders_count_by_user.user_id = Int64(2) AND orders_count_by_user.date = Int32(2222)], distribution: UpstreamHashShard(orders_count_by_user.user_id, orders_count_by_user.date) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can not be pushed down with condition orders_count_by_user.user_id <> 3:Int32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I didn't realize that full table scan range would be converted to empty vec, and I fixed it with a special judgment.

@Li0k
Copy link
Contributor Author

Li0k commented Dec 18, 2024

Test:

CREATE TABLE t1(v1 int primary key);

INSERT INTO t1 (v1) SELECT v FROM generate_series(1, 10000000) AS v;

SELECT COUTN(*) FROM t1;

SELECT * FROM t1 WHERE (v1 > 10 AND v1 < 20) OR (v1 > 5000000 AND v1 < 5000100);

before:
image

after:
image

@Li0k Li0k requested a review from ZENOTME December 23, 2024 04:00

scan_ranges.extend(scan_ranges_chunk);
}
scan_ranges.sort_by(|a, b| a.eq_conds.len().cmp(&b.eq_conds.len()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also sort lower_bound for each range here?🤔

e.g. SELECT * FROM orders_count_by_user WHERE (user_id < 10) or (user_id > 30) or (user_id > 5 and user_id < 15);
got following plan:

batch_plan: |-
    BatchExchange { order: [], dist: Single }
    └─BatchFilter { predicate: (((orders_count_by_user.user_id < 10:Int32) OR (orders_count_by_user.user_id > 30:Int32)) OR ((orders_count_by_user.user_id > 5:Int32) AND (orders_count_by_user.user_id < 15:Int32))) }
      └─BatchScan { table: orders_count_by_user, columns: [orders_count_by_user.user_id, orders_count_by_user.date, orders_count_by_user.orders_count], scan_ranges: [orders_count_by_user.user_id < Int64(10), orders_count_by_user.user_id > Int64(30), orders_count_by_user.user_id > Int64(5) AND orders_count_by_user.user_id < Int64(15)], distribution: UpstreamHashShard(orders_count_by_user.user_id, orders_count_by_user.date) }

I think what we expect is:

batch_plan: |-
    BatchExchange { order: [], dist: Single }
    └─BatchFilter { predicate: (((orders_count_by_user.user_id < 10:Int32) OR (orders_count_by_user.user_id > 30:Int32)) OR ((orders_count_by_user.user_id > 5:Int32) AND (orders_count_by_user.user_id < 15:Int32))) }
      └─BatchScan { table: orders_count_by_user, columns: [orders_count_by_user.user_id, orders_count_by_user.date, orders_count_by_user.orders_count], scan_ranges: [orders_count_by_user.user_id < Int64(15), orders_count_by_user.user_id > Int64(30)], distribution: UpstreamHashShard(orders_count_by_user.user_id, orders_count_by_user.date) }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to reorder the predicate: SELECT * FROM orders_count_by_user WHERE (user_id < 10) or (user_id > 5 and user_id < 15) or (user_id > 30);

  batch_plan: |-
    BatchExchange { order: [], dist: Single }
    └─BatchFilter { predicate: (((orders_count_by_user.user_id < 10:Int32) OR ((orders_count_by_user.user_id > 5:Int32) AND (orders_count_by_user.user_id < 15:Int32))) OR (orders_count_by_user.user_id > 30:Int32)) }
      └─BatchScan { table: orders_count_by_user, columns: [orders_count_by_user.user_id, orders_count_by_user.date, orders_count_by_user.orders_count], scan_ranges: [orders_count_by_user.user_id <= Int64(15), orders_count_by_user.user_id > Int64(30)], distribution: UpstreamHashShard(orders_count_by_user.user_id, orders_count_by_user.date) }

Why we get orders_count_by_user.user_id <= Int64(15) rather than orders_count_by_user.user_id < Int64(15) here?🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we need to sort the lower bound instead of equal condition here.

Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM


scan_ranges.extend(scan_ranges_chunk);
}
scan_ranges.sort_by(|a, b| a.eq_conds.len().cmp(&b.eq_conds.len()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we need to sort the lower bound instead of equal condition here.

@Li0k
Copy link
Contributor Author

Li0k commented Dec 23, 2024

fix it , PTAL @chenzl25 @ZENOTME

@@ -98,6 +100,138 @@ impl ScanRange {
range: full_range(),
}
}

pub fn covert_to_range(&self) -> (Bound<Vec<Datum>>, Bound<Vec<Datum>>) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name covert_to_range appears to be a typo and should be convert_to_range. This typo is propagated in the function's usage on lines 133 and 371. Consider fixing all instances to maintain consistency and readability.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

Comment on lines +155 to +157
if left_start_vec.is_empty() && right_start_vec.is_empty() {
return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When both start vectors are empty, the code returns true without checking the end bounds. This could incorrectly mark ranges as overlapping when they don't actually overlap. Consider this case:

Range 1: (-∞, 5]
Range 2: (-∞, 3]

These ranges overlap, but:

Range 1: (-∞, 3]
Range 2: (5, ∞)

These ranges don't overlap despite both having empty start vectors. The end bounds need to be checked in all cases.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@@ -428,3 +428,54 @@
expected_outputs:
- logical_plan
- batch_plan
- name: When OR clauses contain non-overlapping conditions,, we can pushdown serveral scan_range.
before:
- create_table_and_mv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggests adding some tests in descending order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, We don't need to consider the direction of the ordering in the range comparison in this pr, so we change all the defaults to asce to simplify the comparison logic.

Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your PR!

@Li0k Li0k enabled auto-merge December 24, 2024 11:59
@Li0k Li0k added this pull request to the merge queue Dec 24, 2024
Merged via the queue into main with commit 3431eab Dec 24, 2024
28 of 29 checks passed
@Li0k Li0k deleted the li0k/batch_predicate_pushdown branch December 24, 2024 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants