-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Perf: Support automatically concat_batches for sort which will improve performance #15375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
cc @alamb ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ concat_batches_for_sort ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1 │ 2241.04ms │ 1816.69ms │ +1.23x faster │
│ Q2 │ 1841.01ms │ 1496.73ms │ +1.23x faster │
│ Q3 │ 12755.85ms │ 12770.18ms │ no change │
│ Q4 │ 4433.49ms │ 3278.70ms │ +1.35x faster │
│ Q5 │ 4414.15ms │ 4409.04ms │ no change │
│ Q6 │ 4543.09ms │ 4597.32ms │ no change │
│ Q7 │ 8012.85ms │ 9026.30ms │ 1.13x slower │
│ Q8 │ 6572.37ms │ 6049.51ms │ +1.09x faster │
│ Q9 │ 6734.63ms │ 6345.69ms │ +1.06x faster │
│ Q10 │ 9896.16ms │ 9564.17ms │ no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main) │ 61444.64ms │
│ Total Time (concat_batches_for_sort) │ 59354.33ms │
│ Average Time (main) │ 6144.46ms │
│ Average Time (concat_batches_for_sort) │ 5935.43ms │
│ Queries Faster │ 5 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 4 │
└────────────────────────────────────────┴────────────┘ |
This is a great observation, and the POC optimization has a high ROI. Here are some additional thoughts: This is just intuition, but I think the sorting phase should be faster than the sort-preserving merge phase. Because the sorting implementation is much simpler, it can rely on the existing optimized arrow row format and also quick sort implementation from the standard library. On the contrary, merging phase we have an in-house implementation of a loser tree heap, I think it's a bit complex so maybe it is also hard to optimize manually. ExampleThere is a sort query to run in 4 partitions, and each partition will process 100 input batches
ImplementationThe POC will copy the batches with |
Thank you @2010YOUY01 for review and good suggestion, i will improve my POC code and add more testing. |
Really nice observation! I think we should drive this further. Some further observations I saw when looking at the current implementation on master for the in memory merging part:
|
Thank you @Dandandan , addressed your comments. And we can make it as the first version. And in future we may can improve it as described by @2010YOUY01 : And i think current implementation is also reasonable because the sort_in_place_threshold_bytes is a already used config, we can first reuse it to concat batch and it's safe. |
@2010YOUY01 that sound like a very promising future direction. I might try something experimenting on this soon if none beats me to it. |
Is your feature request related to a problem or challenge?
We should investigate and improve the sort code to support concat_batches for more cases besides the following case:
See details about the performance improvement:
#15348 (comment)
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: