Skip to content

API in ParquetExec to pass in RowSelections to ParquetExec (enable custom indexes, finer grained pushdown) #9929

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We are building / testing a specialized index for data stored in parquet that can tell us what row offsets are needed from the parquet file based on additional infomration

Currently the parquet-rs parquet reader allows specifying this type of information via ArrowReaderBuilder::with_row_selection

However, the DataFusion ParquetExec has no way to pass this information down. It does build its own

Describe the solution you'd like

What I would like is a way to provide something like a RowSelection for each row group

Describe alternatives you've considered

Here is one possible API:

let parquet_selection = ParquetSelection::new()
  // * rows 100-250 from row group 1
  .select(1, RowSelection::from(vec![
    RowSelector::skip(100),
    RowSelector::select(150)
  ]);
  // * rows 50-100 and 200-300 in row group 2
  .select(2, RowSelection::from(vec![
    RowSelector::skip(50),
    RowSelector::select(50),
    RowSelector::skip(100),
    RowSelector::select(100),
  ]);

let parquet_exec = ParquetExec::new(...)
  .with_selection(parquet_selection);

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions