Replies: 2 comments 9 replies
-
Some ideas:
|
Beta Was this translation helpful? Give feedback.
9 replies
-
I also think @Ted-Jiang added code (not yet released) in #7570 that caches parquet data statistics. Maybe this could help the usecase described in this PR as well. The usecase was a little different (reusing the statistics within a session, rather than within a query) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
We've been experimenting in the last couple of days with Datafusion (31.0) and we've been comparing performances with our existing ClickHouse setup. To do so, we have exported a 5GB Parquet dataset, and building some UDFs we managed to replicate some of our queries.
In the end we are running a single quite simple query over the same Parquet dataset on the same mac with both Datafusion and ClickHouse. ClickHouse is always answering in about 700ms while Datafusion in 1.2s.
I've tried multiple settings, verified it was not our UDF causing it, checked there was no cache on ClickHouse, and I couldn't make it any faster with Datafusion. According to the EXPLAIN ANALYZE the poor performances are coming from the Parquet phase.
I have to confess that we are beginners in Rust and we might have missed something, hence this message.
Here is the EXPLAIN ANALYZE of our query, if it can help:
We also noticed in some other queries that when having more Parquet files performances were much worse than in ClickHouse compared to a single Parquet file.
So is there anything we might have missed, that is general knowledge and could that lead to those performances?
Thanks for your help!
Beta Was this translation helpful? Give feedback.
All reactions