-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Cache Parquet Metadata #15582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am working on this for I would be happy to share / upstream any work I do on this if there is interest. |
Thanks @matthewmturner -- what I think would be really valuable is if you could prove/disprove the theory that caching parsed metadata improves local TPCH performance I think it is already pretty well understood that avoiding object store requests to remote object stores helps, and many downstream systems already do this (influx, pydantic, etc) . However, we haven't built in caching for metadata locally because I think we assumed the overhead wasn't very big compared to the complexity of a built in metadata cache -- maybe we should re-evalute that assesment |
I'll mention that we now avoid reading metadata entirely for a lot of queries using an approach along the lines of #15585 |
Is your feature request related to a problem or challenge?
When looking at some Samply profiles of ClickBench queries on my laptop, it appears there are several times where processing stalls due to parsing parquet metadata:
To reproduce, profile using Samply
To reproduce, get the ClickBench dataset
cd benchmarks ./bench.sh data clickbench_1
Then run
datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"
Profile wiht samply (you must build datafusion-cli with `--profile profiling):
Describe the solution you'd like
I think we should look into caching this meta
There is a bunch of prior art like
Also in theory this API should allow metadata caching:
But I don't think there is a default implementation and it isn't hooked up
Describe alternatives you've considered
What I would suggest doing first is
If you see significant performance improvements with 2, then we can figure out how to get it in for real
Additional context
No response
The text was updated successfully, but these errors were encountered: