Cache Parquet Metadata #15582

alamb · 2025-04-04T20:59:59Z

Is your feature request related to a problem or challenge?

When looking at some Samply profiles of ClickBench queries on my laptop, it appears there are several times where processing stalls due to parsing parquet metadata:

To reproduce, profile using Samply

To reproduce, get the ClickBench dataset

cd benchmarks
./bench.sh data clickbench_1

Then run

datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Profile wiht samply (you must build datafusion-cli with `--profile profiling):

samply record datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Describe the solution you'd like

I think we should look into caching this meta

There is a bunch of prior art like

Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

Also in theory this API should allow metadata caching:

https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManager.html

But I don't think there is a default implementation and it isn't hooked up

Describe alternatives you've considered

What I would suggest doing first is

Do profiling / confirm you see the same thing
Make a quick and dirty global parquet metadata cache (just put it into some global variable key on filename)

If you see significant performance improvements with 2, then we can figure out how to get it in for real

Additional context

No response

The text was updated successfully, but these errors were encountered:

matthewmturner · 2025-04-05T02:21:50Z

I am working on this for dft right now actually and I plan on integrating it into the observability feature that I have been working on (where different observability metrics are exposed as tables). Specifically, I made a new MapTable that allows querying and updating different types of map data structures (for now just an IndexMap but i plan to try others like BTreeMap as well). The idea is that the same map that is used for caching both file and parquet metadata can be queried to get details such as when the entry was created, last update, # of hits, etc.

I would be happy to share / upstream any work I do on this if there is interest.

alamb · 2025-04-06T11:47:22Z

I would be happy to share / upstream any work I do on this if there is interest.

Thanks @matthewmturner -- what I think would be really valuable is if you could prove/disprove the theory that caching parsed metadata improves local TPCH performance

I think it is already pretty well understood that avoiding object store requests to remote object stores helps, and many downstream systems already do this (influx, pydantic, etc) .

However, we haven't built in caching for metadata locally because I think we assumed the overhead wasn't very big compared to the complexity of a built in metadata cache -- maybe we should re-evalute that assesment

adriangb · 2025-05-07T13:54:55Z

I'll mention that we now avoid reading metadata entirely for a lot of queries using an approach along the lines of #15585

alamb added the enhancement New feature or request label Apr 4, 2025

alamb mentioned this issue Apr 4, 2025

Perf: remove clone on uninitiated_partitions in SortPreservingMergeStream #15562

Merged

alamb mentioned this issue May 7, 2025

Filter cache based on the paper "Predicate Caching: Query-Driven Secondary Indexing for Cloud Data" #15585

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Parquet Metadata #15582

Cache Parquet Metadata #15582

alamb commented Apr 4, 2025

matthewmturner commented Apr 5, 2025

alamb commented Apr 6, 2025

adriangb commented May 7, 2025

Cache Parquet Metadata #15582

Cache Parquet Metadata #15582

Comments

alamb commented Apr 4, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

matthewmturner commented Apr 5, 2025

alamb commented Apr 6, 2025

adriangb commented May 7, 2025