Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Open
alamb opened this issue Oct 8, 2024 · 18 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

Is your feature request related to a problem or challenge?

I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself

TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench

Background:

https://benchmark.clickhouse.com/ shows the results of ClickBench

ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading

Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:

Screenshot 2024-10-08 at 4 45 16 PM

Describe the solution you'd like

I would like DataFusion to be the fastest

Describe alternatives you've considered

No response

Additional context

This is also inspired by @ozankabak 's call to action on #11442

The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

Last update is here: ClickHouse/ClickBench#210

@alamb alamb added the enhancement New feature or request label Oct 8, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2024

Changes I think will make these queries significantly faster:

These optimizations are general purpose, not specific to Clickhouse I don't think

@alamb alamb changed the title [DISCUSSION] Make DataFusion is the fastest engine for querying parquet data in ClickBench [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench Oct 8, 2024
@jayzhan211
Copy link
Contributor

Reuse hash for repartition #12526 and avoid copy in coalesce #7957 could probably also provide some improvement

@Dandandan
Copy link
Contributor

Dandandan commented Oct 9, 2024

Nice!

I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm (i.e. treating matches as candidates and doing e.g. equality checks in a vectorized way to allow for more specialization / more efficient code).

@Rachelint
Copy link
Contributor

Rachelint commented Oct 9, 2024

🤔 As reviewing #12697 , seems we can still continue to improve partial skipping?
Now we can modify threshold to get performance improvement, but it may be a bit tricky?

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

@alamb alamb mentioned this issue Oct 9, 2024
4 tasks
@alamb
Copy link
Contributor Author

alamb commented Oct 9, 2024

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)

@Rachelint
Copy link
Contributor

Rachelint commented Oct 11, 2024

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)

Got it, @jayzhan211 have tried some other values of skip_partial_aggregation_probe_ratio_threshold and skip_partial_aggregation_probe_rows_threshold, some queries seems improve obviously in #12697

And I have some thoughs like removing the is_locked field?

Now, we take skip_partial_aggregation_probe_rows_threshold as a sample to define if we need to skip, when exceed we will not check this again).
But I found some partial operator can get improvement from skipping, but have no chance to switch to due to is_locked.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 11, 2024

#12697 (comment) Only Q0 slows down, but given it has nothing to do with grouping, I think we can ignore it.

This number is run on another branch that only change the configuration value, so I think another approach is to remove skip_partial_aggregation_probe_rows_threshold and related logic entirely and set skip_partial_aggregation_probe_ratio_threshold to 0.1.

@jayzhan211
Copy link
Contributor

I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm

Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.

@jonathanc-n
Copy link
Contributor

@jayzhan211 Yeah, this sounds like a good idea. We could start stepping into a direction to make the execution engine as performant as Velox. Especially having arrow be the format should allow us to maximize our use of vectorized execution.
Should I open an issue for this?

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2024

Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.

I think @tustvold found that using manually written simd kernels is quite hard to get faster than the auto vectorized code (aka using the vector instructions) made by LLVM and also harder to maintain

If possible I would suggest we instead focus on improving the code so that LLVM is better able to auto vectorize code. This is some combination of looking at the resulting assembly code, and then making the inner loops simpler (e.g. via #[inline] and removing bounds checks get_unchecked, special cases for not checking Option, etc)

@tustvold
Copy link
Contributor

I found that LLVM is relatively good at vectorizing vertical operations provided:

  • There are no conditionals within the loop body
  • You've been careful to avoid inlining too much, as the vectorizer gives up if the code is too complex
  • You aren't doing bitwise horizontal reductions or masking (although FWIW std::simd struggles with this as well)
  • You've enabled SIMD instructions in the target ISA

This last point is likely why you aren't seeing anything, the default x86 ISA is over a decade old at this point and doesn't support pretty much any SIMD instructions. See the Performance Tips section at the end of - https://crates.io/crates/arrow

My 2 cents is to get as far as you can without reaching for std::simd, there is a massive maintainance overhead and with care LLVM can produce code that performs better than naively written manual SIMD. We used to have a fair bit of manual SIMD in arrow-rs, and over time we've removed it as the auto-vectorized code was faster.

I'd recommend getting familiar with tools like https://rust.godbolt.org/ (again being sure to set RUSTFLAGS) and only once you've exhausted that avenue think of reaching for SIMD. Generally the hard part is getting the algorithm structured in such a way that it can be vectorised, regardless of what goes and generates those instructions.

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2024

Thank you @tustvold -- that content is so good I made a PR to propose putting it in the readme of arrow-rs: apache/arrow-rs#6554

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2024

After a few more PRs for StringView I think we are pretty close: #12092 (comment)

I'll try and run the numbers at some point to compare to duckdb, but DataFusion is certainly quite a bit faster than 40.0.0 now and will be even more so once we complete the StringView work

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

StringView by default is finally merged into DataFusion: #13101

@alamb
Copy link
Contributor Author

alamb commented Nov 4, 2024

@Rachelint has another non trivial group by performance improvement that is very close: #12996

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2024

Update here: the results from @pmcgleenon are looking really nice: #13099 (comment)

386562203-8029e9c7-e6d3-4e7e-8273-725472aeeeb9

Also, BTW, 43.0.0 doesn't include the work from @Rachelint that will likely improve things a few more percent overall (substantially for some queries):

@Rachelint has another non trivial group by performance improvement that is very close: #12996

@ozankabak
Copy link
Contributor

Love this 🚀 🚀 🚀

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants