[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs

### Is your feature request related to a problem or challenge?

There are at least three places that parquet statistics are extracted into ArrayRefs today

1. ParquetExec (pruning Row Groups): https://github.com/apache/datafusion/blob/465c89f7f16d48b030d4a384733567b91dab88fa/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L18-L17

* ParquetExec (Pruning pages): https://github.com/apache/datafusion/blob/671cef85c550969ab2c86d644968a048cb181c0c/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L393-L392

* ListingTable (pruning files): https://github.com/apache/datafusion/blob/97148bd105fc2102b0444f2d67ef535937da5dfe/datafusion/core/src/datasource/file_format/parquet.rs#L295-L294

Not only are there three copies of the code, they are all subtly different (e.g. https://github.com/apache/datafusion/issues/8295) and have varying degrees of testing


### Describe the solution you'd like

I would like one API with the following properties:
1. Extracts statistics from one or more parquet files as `ArrayRef`s suitable to pass to [PruningPredicate](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html)
2. Does so correctly (applies the appropriate schema coercion / conversion rules)
3. Does so quickly and efficiently (e.g. does not do this once per row group), is suitable for 1000s of parquet files

### Describe alternatives you've considered

Some [ideas](https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099247223) from https://github.com/apache/arrow-rs/issues/4328

## Subtasks
- [x] https://github.com/apache/datafusion/pull/10537
- [x] https://github.com/apache/datafusion/issues/10606
- [x] https://github.com/apache/datafusion/issues/10585
- [ ] https://github.com/apache/datafusion/issues/10586
- [x] https://github.com/apache/datafusion/issues/10587
- [ ] Complete porting tests from `datafusion/core/src/datasource/physical_plan/parquet/statistics.rs` to `datafusion/core/tests/parquet/arrow_statistics.rs`
- [x] Convert ParquetExec to use the new API for RowGroup pruning
- [ ] Convert ParquetExec to use the new API for PagePruning
- [ ] Convert `ListingTable` to use the new API for file pruning
- [x] https://github.com/apache/datafusion/issues/10604
- [x]  #10605
- [ ] https://github.com/apache/datafusion/issues/10609
- [x] https://github.com/apache/datafusion/issues/10626
- [x] https://github.com/apache/datafusion/pull/10729
- [x] #10758
- [x] #10757
- [x] #10751
- [x] #10752
- [x] #10754
- [x] #10753
- [x] #10756
- [x] #10755

Follow on projects:
- [x] https://github.com/apache/datafusion/issues/10806

Here is a proposed API:

```rust
/// statistics extracted from `Statistics` as Arrow `ArrayRef`s
///
/// # Note:
/// If the corresponding `Statistics` is not present, or has no information for 
/// a column, a NULL is present in the  corresponding array entry
pub struct ArrowStatistics {
  /// min values
  min: ArrayRef,
  /// max values
  max: ArrayRef,
  /// Row counts (UInt64Array)
  row_count: ArrayRef,
  /// Null Counts (UInt64Array)
  null_count: ArrayRef,
}

// (TODO accessors for min/max/row_count/null_count)

/// Extract `ArrowStatistics` from the  parquet [`Statistics`]
pub fn parquet_stats_to_arrow(
    arrow_datatype: &DataType,
    statistics: impl IntoIterator<Item = Option<&Statistics>>
) -> Result<ArrowStatisics> {
  todo!()
}
```

Maybe it would make sense to have something more builder style:

```rust
struct ParquetStatisticsExtractor {
...
}

// create an extractor that can extract data from parquet files 
let extractor = ParquetStatisticsExtractor::new(arrow_schema, parquet_schema)

// get parquet statistics (one for each row group) somehow:
let parquet_stats: Vec<&Statistics> = ...;

// extract min/max values for column "a" and "b";
let col_a stats = extractor.extract("a", parquet_stats.iter());
let col_b stats = extractor.extract("b", parquet_stats.iter());
```


(This is similar to the existing API [parquet](https://docs.rs/parquet/latest/parquet/index.html)::[arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html)::[parquet_to_arrow_schema](https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_schema.html#))

Note `Statistics` above is [`Statistics`](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html) 

There is a version of this code here in DataFusion that could perhaps be adapted: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L186

## Testing
I suggest we add a new module to the existing parquet test  in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_exec.rs

The tests should look like:
```
let record_batch = make_batch_with_relevant_datatype();
// write batch/batches to file
// open file / extract stats from metadata
// compare stats
```

I can help writing these tests 

I personally suggest:
1. Make a PR with the basic API and a single basic types (like Int/UInt or String) and figure out the test pattern (I can definitely help here)
2. Then we can fill out support for the rest of the types in a follow on PR

cc @tustvold  in case you have other ideas

### Additional context

This code likely eventually would be good to have in the parquet crate -- see https://github.com/apache/arrow-rs/issues/4328. However, I think initially we should do it in DataFusion to iterate faster and figure out the API before moving it up there


There are a bunch of related improvements that I think become much simpler with this feature:
1. https://github.com/apache/datafusion/issues/8229 

	pub(crate) fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>(
	data_type: &DataType,
	iterator: I,
	) -> Result<ArrayRef> {
	let scalars = iterator
	.map(\|x\| x.and_then(\|s\| get_statistic!(s, min, min_bytes, Some(data_type))));
	collect_scalars(data_type, scalars)
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Subtasks

Testing

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Subtasks

Testing

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions