Closed
Description
Is your feature request related to a problem or challenge?
We are building / testing a specialized index for data stored in parquet that can tell us what row offsets are needed from the parquet file based on additional infomration
Currently the parquet-rs parquet reader allows specifying this type of information via ArrowReaderBuilder::with_row_selection
However, the DataFusion ParquetExec
has no way to pass this information down. It does build its own
Describe the solution you'd like
What I would like is a way to provide something like a RowSelection
for each row group
Describe alternatives you've considered
Here is one possible API:
let parquet_selection = ParquetSelection::new()
// * rows 100-250 from row group 1
.select(1, RowSelection::from(vec![
RowSelector::skip(100),
RowSelector::select(150)
]);
// * rows 50-100 and 200-300 in row group 2
.select(2, RowSelection::from(vec![
RowSelector::skip(50),
RowSelector::select(50),
RowSelector::skip(100),
RowSelector::select(100),
]);
let parquet_exec = ParquetExec::new(...)
.with_selection(parquet_selection);
Additional context
No response