Skip to content

[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

There are at least three places that parquet statistics are extracted into ArrayRefs today

  1. ParquetExec (pruning Row Groups): https://github.com/apache/datafusion/blob/465c89f7f16d48b030d4a384733567b91dab88fa/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L18-L17

Not only are there three copies of the code, they are all subtly different (e.g. #8295) and have varying degrees of testing

Describe the solution you'd like

I would like one API with the following properties:

  1. Extracts statistics from one or more parquet files as ArrayRefs suitable to pass to PruningPredicate
  2. Does so correctly (applies the appropriate schema coercion / conversion rules)
  3. Does so quickly and efficiently (e.g. does not do this once per row group), is suitable for 1000s of parquet files

Describe alternatives you've considered

Some ideas from apache/arrow-rs#4328

Subtasks

Follow on projects:

Here is a proposed API:

/// statistics extracted from `Statistics` as Arrow `ArrayRef`s
///
/// # Note:
/// If the corresponding `Statistics` is not present, or has no information for 
/// a column, a NULL is present in the  corresponding array entry
pub struct ArrowStatistics {
  /// min values
  min: ArrayRef,
  /// max values
  max: ArrayRef,
  /// Row counts (UInt64Array)
  row_count: ArrayRef,
  /// Null Counts (UInt64Array)
  null_count: ArrayRef,
}

// (TODO accessors for min/max/row_count/null_count)

/// Extract `ArrowStatistics` from the  parquet [`Statistics`]
pub fn parquet_stats_to_arrow(
    arrow_datatype: &DataType,
    statistics: impl IntoIterator<Item = Option<&Statistics>>
) -> Result<ArrowStatisics> {
  todo!()
}

Maybe it would make sense to have something more builder style:

struct ParquetStatisticsExtractor {
...
}

// create an extractor that can extract data from parquet files 
let extractor = ParquetStatisticsExtractor::new(arrow_schema, parquet_schema)

// get parquet statistics (one for each row group) somehow:
let parquet_stats: Vec<&Statistics> = ...;

// extract min/max values for column "a" and "b";
let col_a stats = extractor.extract("a", parquet_stats.iter());
let col_b stats = extractor.extract("b", parquet_stats.iter());

(This is similar to the existing API parquet::arrow::parquet_to_arrow_schema)

Note Statistics above is Statistics

There is a version of this code here in DataFusion that could perhaps be adapted:

pub(crate) fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>(
data_type: &DataType,
iterator: I,
) -> Result<ArrayRef> {
let scalars = iterator
.map(|x| x.and_then(|s| get_statistic!(s, min, min_bytes, Some(data_type))));
collect_scalars(data_type, scalars)
}

Testing

I suggest we add a new module to the existing parquet test in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_exec.rs

The tests should look like:

let record_batch = make_batch_with_relevant_datatype();
// write batch/batches to file
// open file / extract stats from metadata
// compare stats

I can help writing these tests

I personally suggest:

  1. Make a PR with the basic API and a single basic types (like Int/UInt or String) and figure out the test pattern (I can definitely help here)
  2. Then we can fill out support for the rest of the types in a follow on PR

cc @tustvold in case you have other ideas

Additional context

This code likely eventually would be good to have in the parquet crate -- see apache/arrow-rs#4328. However, I think initially we should do it in DataFusion to iterate faster and figure out the API before moving it up there

There are a bunch of related improvements that I think become much simpler with this feature:

  1. Consolidate statistics aggregation #8229

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions