Custom Predicates for ParquetExec and Parquet row indexes #9341
-
Hello, I've been looking at adding some indexing on parquet files by creating some auxiliary files and using those to create predicates for the ParquetExec struct to use. The short of it is I have an optimizer in place to replace existing ParquetExec instances with new instances using a custom PhysicalExpr for the predicate: let new_exec = ParquetExec::new(
exec.base_config().clone(),
Some(Arc::new(MyCustomPhysicalExpr::new(/* args */))),
None
).with_pushdown_filters(true); My custom PhysicalExpr then has an fn evaluate(&self, batch: &RecordBatch) -> datafusion::common::Result<ColumnarValue> {
let my_index: HashSet<usize> = self.run_my_index(/* args */);
let indexes = (0..batch.num_rows()).map(|i| my_index.contains(i as usize)).collect::<Vec<_>>();
Ok(ColumnarValue::Array(Arc::new(BooleanArray::from(indexes))))
}
So my question is whether there is a way from a RecordBatch to load what the original record indexes in the parquet file were. I haven't seen anything yet, but I'd imagine I am looking in the wrong places. Alternatively, if the approach is just wrong and there's a better way to use indexes like this in DataFusion, I'd be very interested. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
Just as some additional context on places I'd looked but without luck so far:
Following up on (1), my current fallback plan is to use a column in the parquet as the "lookup" and use that in the index instead of a row number, but it would add additional constraints on the index usage (e.g. you need a unique column to index on). |
Beta Was this translation helpful? Give feedback.
-
By row numbers, do you mean physical row numbers? A A pushed-down predicate, will be evaluated at partition levels, which means RecordBatches produced by different partitions will be filtered individually:
So I think you need to ensure that the generated
|
Beta Was this translation helpful? Give feedback.
-
I think maybe you can use your |
Beta Was this translation helpful? Give feedback.
I don't think this is possible today in DataFusion (or in the parquet rust reader)
As @Ted-Jiang
RowSelection
can describe this concept, but I believe the row selections are per record group (not for the file as a whole) -- also I don't think this is exposed in some way you can provide a row selection to pass into the underlying reader