Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: support predicate pushdown for partial reads / exclusions #47231

Open
derekperkins opened this issue Jun 19, 2024 · 0 comments
Open

Comments

@derekperkins
Copy link

derekperkins commented Jun 19, 2024

Feature request

Is your feature request related to a problem? Please describe.

When accessing a remote parquet file using FILES, the entire file is fetched across the network before executing. This can result in waiting for hundreds of megabytes to download, then seeing an error like #37169 where the encoding isn't supported

As of StarRocks v3.3.0-rc02, an unsupported encoding in a parquet file, even if it isn't referenced by the query, makes the entire file unqueryable. Only the specific columns in the SELECT should be fetched, which saves both network, and should make it so that StarRocks can read columns even if others aren't supported.

This is listed in the 2024 roadmap, but I couldn't find a tracking issue for it

Describe the solution you'd like

Support parquet predicate pushdown, so that only specific metadata and/or columns are read.

  1. By looking at the metadata, unsupported encodings could throw an error without reading the entire file
  2. By utilizing object store range reads, only fetch the column data requested by the query, rather than the whole file

Describe alternatives you've considered

DuckDB, Clickhouse, etc

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant