-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[Epic] A collection of dynamic filtering related items #15512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding #15534 |
I've started breaking out the work:
|
I briefly looked at the descriptions of these optimizations. For example, the method of dynamically handling the "order by limit" process using statistics is really cool! @alamb IdeaBut I have some new ideas that seem to be more universally applicable to ┌──────────────────────────────────────┐
│ [Step 1] Filter RowGroups │
│ - Use Parquet metadata to skip RGs │
│ WHERE RG.min(EventTime) > cutoff │
└───────────────────┬──────────────────┘
↓
┌──────────────────────────────────────┐
│ [Step 2] Read EventTime + Filter │
│ - Scan EventTime column in valid RGs │
│ - Sort values → Track top 10 rows │
└───────────────────┬──────────────────┘
↓
┌──────────────────────────────────────┐
│ [Step 3] Record Row Locations │
│ - Map top 10 EventTime to physical │
│ positions (RG_ID + Row_Offset) │
└───────────────────┬──────────────────┘
↓
┌──────────────────────────────────────┐
│ [Step 4] Fetch 10 Rows │
│ - Directly read rows from Parquet │
│ via recorded positions (non-seq) │
└───────────────────┬──────────────────┘
↓
Final Result ExploreCurrently, q23 takes approximately 6 seconds to execute. I have confirmed that DataFusion does not have the aforementioned optimizations and still scans a very large number of rows and columns. By the way, is there a convenient way in Some concernsParquet is composed of RowGroups. Is it difficult to read an individual page? In my previous work, I’ve seen optimizations for this scenario (based on DataFusion), but it used a custom columnar storage format, which was easier to implement. At that time, when working with very large datasets (similar to the hits dataset), the query time for "order by limit" was reduced to around 2 seconds. SummaryThe reading process of the entire data can be delayed by using "order by" on columns, which is very effective for the "order by limit" scenario. I'm not sure if DataFusion is currently doing this. |
Thank you @acking-you for the idea. Does it similar to parquet filter pushdown? We are already trying to make it default. With parquet filter pushdown, we will only scan the filtered pages, but the topk filter pushdown is still in progress. |
This optimization targets the The approach is quite easy to understand. It involves completing the sorting by only reading the Take
|
I also noticed that recently, a PR was merged in ClickHouse that does exactly what was described above: ClickHouse/ClickHouse#55518, with the corresponding issue: ClickHouse/ClickHouse#45868. @zhuqi-lucas |
@acking-you have you seen #15301? |
Thank you for your hint. I haven't looked into this part of the code yet. I only reviewed the optimization method you described in #15037. It seems that there is no mention of deferred materialization for "order by limit" (perhaps I missed it since the content of the issue is quite long). So, have we considered optimizing "order by limit" in two phases? I'm planning to review and study the PR you mentioned over the weekend. Thanks again for your reply. |
DataFusion already does late materialization: if orders filters by least to most expensive, then scans only the columns that the filter needs. Once it applies all filters it then materializes the projection. It's not the default yet but it will be soon. Please see #3463 which @zhuqi-lucas linked to above. |
I'm sorry. I thought the late materialization was for filter column without considering order by column. This is my fault. I should have looked deeper. |
Well the point of this recent work is to create filters from the order by clause :) |
@acking-you I agree with your analysis and arrived at a similar conclusion in this ticket:
I also agree The deferred materialization is key to improving performance massively. I believe this is the effect of #3463 though it does not use that term
So TLDR is by combing the following two items I think DataFusion will have the equivalent of materialized filter |
Is your feature request related to a problem or challenge?
This is a collection of various items related to "dynamic filtering"
Roughly speaking dynamic filters are filters who values are known during runtime but not planning (for example from the values of topk)
Here is some more background:
Related items
ORDER BY LIMIT
queries) #15037ParquetSource::pruning_predicate
#15534The text was updated successfully, but these errors were encountered: