-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821
Comments
Changes I think will make these queries significantly faster:
These optimizations are general purpose, not specific to Clickhouse I don't think |
Nice! I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm (i.e. treating matches as candidates and doing e.g. equality checks in a vectorized way to allow for more specialization / more efficient code). |
🤔 As reviewing #12697 , seems we can still continue to improve partial skipping? And I think maybe we can make clearer about when partial can help, and when partial will even get slower? |
In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it) |
Got it, @jayzhan211 have tried some other values of And I have some thoughs like removing the Now, we take |
#12697 (comment) Only Q0 slows down, but given it has nothing to do with grouping, I think we can ignore it. This number is run on another branch that only change the configuration value, so I think another approach is to remove |
Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change. |
@jayzhan211 Yeah, this sounds like a good idea. We could start stepping into a direction to make the execution engine as performant as Velox. Especially having arrow be the format should allow us to maximize our use of vectorized execution. |
I think @tustvold found that using manually written simd kernels is quite hard to get faster than the auto vectorized code (aka using the vector instructions) made by LLVM and also harder to maintain If possible I would suggest we instead focus on improving the code so that LLVM is better able to auto vectorize code. This is some combination of looking at the resulting assembly code, and then making the inner loops simpler (e.g. via |
I found that LLVM is relatively good at vectorizing vertical operations provided:
This last point is likely why you aren't seeing anything, the default x86 ISA is over a decade old at this point and doesn't support pretty much any SIMD instructions. See the Performance Tips section at the end of - https://crates.io/crates/arrow My 2 cents is to get as far as you can without reaching for std::simd, there is a massive maintainance overhead and with care LLVM can produce code that performs better than naively written manual SIMD. We used to have a fair bit of manual SIMD in arrow-rs, and over time we've removed it as the auto-vectorized code was faster. I'd recommend getting familiar with tools like https://rust.godbolt.org/ (again being sure to set RUSTFLAGS) and only once you've exhausted that avenue think of reaching for SIMD. Generally the hard part is getting the algorithm structured in such a way that it can be vectorised, regardless of what goes and generates those instructions. |
Thank you @tustvold -- that content is so good I made a PR to propose putting it in the readme of arrow-rs: apache/arrow-rs#6554 |
After a few more PRs for StringView I think we are pretty close: #12092 (comment) I'll try and run the numbers at some point to compare to duckdb, but DataFusion is certainly quite a bit faster than 40.0.0 now and will be even more so once we complete the StringView work |
StringView by default is finally merged into DataFusion: #13101 |
@Rachelint has another non trivial group by performance improvement that is very close: #12996 |
Update here: the results from @pmcgleenon are looking really nice: #13099 (comment) Also, BTW, 43.0.0 doesn't include the work from @Rachelint that will likely improve things a few more percent overall (substantially for some queries):
|
Love this 🚀 🚀 🚀 |
Is your feature request related to a problem or challenge?
I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself
TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench
Background:
https://benchmark.clickhouse.com/ shows the results of ClickBench
ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading
Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:
Describe the solution you'd like
I would like DataFusion to be the fastest
Describe alternatives you've considered
No response
Additional context
This is also inspired by @ozankabak 's call to action on #11442
The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
Last update is here: ClickHouse/ClickBench#210
The text was updated successfully, but these errors were encountered: