Skip to content

Cache Parquet Metadata #15582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue Apr 4, 2025 · 3 comments
Open

Cache Parquet Metadata #15582

alamb opened this issue Apr 4, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

Is your feature request related to a problem or challenge?

When looking at some Samply profiles of ClickBench queries on my laptop, it appears there are several times where processing stalls due to parsing parquet metadata:

Screenshot 2025-04-04 at 4 47 14 PM

To reproduce, profile using Samply

To reproduce, get the ClickBench dataset

cd benchmarks
./bench.sh data clickbench_1

Then run

datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Profile wiht samply (you must build datafusion-cli with `--profile profiling):

samply record datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Describe the solution you'd like

I think we should look into caching this meta

There is a bunch of prior art like

Also in theory this API should allow metadata caching:

But I don't think there is a default implementation and it isn't hooked up

Describe alternatives you've considered

What I would suggest doing first is

  1. Do profiling / confirm you see the same thing
  2. Make a quick and dirty global parquet metadata cache (just put it into some global variable key on filename)

If you see significant performance improvements with 2, then we can figure out how to get it in for real

Additional context

No response

@matthewmturner
Copy link
Contributor

I am working on this for dft right now actually and I plan on integrating it into the observability feature that I have been working on (where different observability metrics are exposed as tables). Specifically, I made a new MapTable that allows querying and updating different types of map data structures (for now just an IndexMap but i plan to try others like BTreeMap as well). The idea is that the same map that is used for caching both file and parquet metadata can be queried to get details such as when the entry was created, last update, # of hits, etc.

I would be happy to share / upstream any work I do on this if there is interest.

@alamb
Copy link
Contributor Author

alamb commented Apr 6, 2025

I would be happy to share / upstream any work I do on this if there is interest.

Thanks @matthewmturner -- what I think would be really valuable is if you could prove/disprove the theory that caching parsed metadata improves local TPCH performance

I think it is already pretty well understood that avoiding object store requests to remote object stores helps, and many downstream systems already do this (influx, pydantic, etc) .

However, we haven't built in caching for metadata locally because I think we assumed the overhead wasn't very big compared to the complexity of a built in metadata cache -- maybe we should re-evalute that assesment

@adriangb
Copy link
Contributor

adriangb commented May 7, 2025

I'll mention that we now avoid reading metadata entirely for a lot of queries using an approach along the lines of #15585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants