Extract parquet statistics from `Interval` columns #10752

alamb · 2024-06-01T20:28:10Z

Is your feature request related to a problem or challenge?

Part of #10453, where we are filling out support for extracting statistics for all data types from parquet files

At the moment, even if statistics are extracted for a different type (like Int32) the PruningPredicate will attempt to cast these values to the correct type:

datafusion/datafusion/core/src/physical_optimizer/pruning.rs

Lines 909 to 911 in acd7106

    
           // cast statistics array to required data type (e.g. parquet 
        
           // provides timestamp statistics as "Int64") 
        
           let array = arrow::compute::cast(&array, data_type)?;

However, in order to be efficient and ensure the cast kernel doesn't add anything incorrectly, we should be extracting the parquet statistics as the correct Array type directly. It turns out we do not do this yet for several types and those types do not have good (or any) test coverage. We almost missed this in #10711 in @xinlifoobar

Thus, we need to add support and tests for other types

Describe the solution you'd like

Add a new test to arrow_stastics.rs (run this with cargo test --test parquet_exec) with the relevant type

Potentially add a new case to the match here

datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

Lines 61 to 182 in acd7106

    
           macro_rules! get_statistic { 
        
               ($column_statistics:expr, $func:ident, $bytes_func:ident, $target_arrow_type:expr) => {{ 
        
                   if !$column_statistics.has_min_max_set() { 
        
                       return None; 
        
                   } 
        
                   match $column_statistics { 
        
                       ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))), 
        
                       ParquetStatistics::Int32(s) => { 
        
                           match $target_arrow_type { 
        
                               // int32 to decimal with the precision and scale 
        
                               Some(DataType::Decimal128(precision, scale)) => { 
        
                                   Some(ScalarValue::Decimal128( 
        
                                       Some(*s.$func() as i128), 
        
                                       *precision, 
        
                                       *scale, 
        
                                   )) 
        
                               } 
        
                               Some(DataType::Int8) => { 
        
                                   Some(ScalarValue::Int8(Some((*s.$func()).try_into().unwrap()))) 
        
                               } 
        
                               Some(DataType::Int16) => { 
        
                                   Some(ScalarValue::Int16(Some((*s.$func()).try_into().unwrap()))) 
        
                               } 
        
                               Some(DataType::UInt8) => { 
        
                                   Some(ScalarValue::UInt8(Some((*s.$func()).try_into().unwrap()))) 
        
                               } 
        
                               Some(DataType::UInt16) => { 
        
                                   Some(ScalarValue::UInt16(Some((*s.$func()).try_into().unwrap()))) 
        
                               } 
        
                               Some(DataType::UInt32) => { 
        
                                   Some(ScalarValue::UInt32(Some((*s.$func()) as u32))) 
        
                               } 
        
                               Some(DataType::Date32) => { 
        
                                   Some(ScalarValue::Date32(Some(*s.$func()))) 
        
                               } 
        
                               Some(DataType::Date64) => { 
        
                                   Some(ScalarValue::Date64(Some(i64::from(*s.$func()) * 24 * 60 * 60 * 1000))) 
        
                               } 
        
                               _ => Some(ScalarValue::Int32(Some(*s.$func()))), 
        
                           } 
        
                       } 
        
                       ParquetStatistics::Int64(s) => { 
        
                           match $target_arrow_type { 
        
                               // int64 to decimal with the precision and scale 
        
                               Some(DataType::Decimal128(precision, scale)) => { 
        
                                   Some(ScalarValue::Decimal128( 
        
                                       Some(*s.$func() as i128), 
        
                                       *precision, 
        
                                       *scale, 
        
                                   )) 
        
                               } 
        
                               Some(DataType::UInt64) => { 
        
                                   Some(ScalarValue::UInt64(Some((*s.$func()) as u64))) 
        
                               } 
        
                               _ => Some(ScalarValue::Int64(Some(*s.$func()))), 
        
                           } 
        
                       } 
        
                       // 96 bit ints not supported 
        
                       ParquetStatistics::Int96(_) => None, 
        
                       ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))), 
        
                       ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))), 
        
                       ParquetStatistics::ByteArray(s) => { 
        
                           match $target_arrow_type { 
        
                               // decimal data type 
        
                               Some(DataType::Decimal128(precision, scale)) => { 
        
                                   Some(ScalarValue::Decimal128( 
        
                                       Some(from_bytes_to_i128(s.$bytes_func())), 
        
                                       *precision, 
        
                                       *scale, 
        
                                   )) 
        
                               } 
        
                               Some(DataType::Binary) => { 
        
                                   Some(ScalarValue::Binary(Some(s.$bytes_func().to_vec()))) 
        
                               } 
        
                               _ => { 
        
                                   let s = std::str::from_utf8(s.$bytes_func()) 
        
                                       .map(|s| s.to_string()) 
        
                                       .ok(); 
        
                                   if s.is_none() { 
        
                                       log::debug!( 
        
                                           "Utf8 statistics is a non-UTF8 value, ignoring it." 
        
                                       ); 
        
                                   } 
        
                                   Some(ScalarValue::Utf8(s)) 
        
                               } 
        
                           } 
        
                       } 
        
                       // type not fully supported yet 
        
                       ParquetStatistics::FixedLenByteArray(s) => { 
        
                           match $target_arrow_type { 
        
                               // just support specific logical data types, there are others each 
        
                               // with their own ordering 
        
                               Some(DataType::Decimal128(precision, scale)) => { 
        
                                   Some(ScalarValue::Decimal128( 
        
                                       Some(from_bytes_to_i128(s.$bytes_func())), 
        
                                       *precision, 
        
                                       *scale, 
        
                                   )) 
        
                               } 
        
                               Some(DataType::FixedSizeBinary(size)) => { 
        
                                   let value = s.$bytes_func().to_vec(); 
        
                                   let value = if value.len().try_into() == Ok(*size) { 
        
                                       Some(value) 
        
                                   } else { 
        
                                       log::debug!( 
        
                                           "FixedSizeBinary({}) statistics is a binary of size {}, ignoring it.", 
        
                                           size, 
        
                                           value.len(), 
        
                                       ); 
        
                                       None 
        
                                   }; 
        
                                   Some(ScalarValue::FixedSizeBinary( 
        
                                       *size, 
        
                                       value, 
        
                                   )) 
        
                               } 
        
                               _ => None, 
        
                           } 
        
                       } 
        
                   } 
        
               }}; 
        
           }

to get the correct types

Here are some example PRs:

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

marvinlanhenke · 2024-06-02T18:31:31Z

take

marvinlanhenke · 2024-06-03T06:13:42Z

...while looking into this I noticed, that there are no statistics written for an Interval, which is also described here.

@alamb I guess we can't extract any statistics here? And writing any tests that check we have no statistics written, does not seem to be very helpful?

alamb · 2024-06-04T15:43:15Z

@alamb I guess we can't extract any statistics here? And writing any tests that check we have no statistics written, does not seem to be very helpful?

I actually think these would be helpful then as soon as there are statistics we can hook them up to the tests. If you had time to write the tests that would be great. We can then perhaps file a ticket in parquet-rs for supporting writing statistics to interval types.

marvinlanhenke · 2024-06-04T18:58:03Z

sure I can do that; from the top of my mind - the fn run() from the struct Test panics if we can't extract any statistics, which is the case here. So I'd prepare as much as possible (creating record batches, adding a Scenario, writing those tests) but for now would assert should panic - does this make any sense to you @alamb?

marvinlanhenke · 2024-06-05T13:51:44Z

I did some digging in order to find out why / or where the writing of those statistics is not supported (yet).
Since I'm not familiar with the parquet impl, here are my findings, which might be useful in a follow-up ticket in arrow-rs.

When trying to fn write_slice() the min, max values are never updated due to a filter-condition; that checks if the type is INTERVAL
In order to support updating the min max values, we need to handle the comparison of INTERVAL here

I think this should be possible, or put differently, I don't see the reason yet, why this is not supported?
Somethin similar (comparing FixedLenByteArrays) is already done for DECIMAL here?

Perhaps, you have some more information on this @alamb - otherwise this might be enough information to file a ticket in arrow-rs?

alamb mentioned this issue Jun 1, 2024

[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453

Closed

23 tasks

github-actions bot assigned marvinlanhenke Jun 2, 2024

marvinlanhenke mentioned this issue Jun 5, 2024

Extract Parquet statistics from Interval column #10801

Merged

marvinlanhenke mentioned this issue Jun 5, 2024

Write parquet statistics for IntervalDayTimeArray , IntervalMonthDayNanoArray and IntervalYearMonthArray apache/arrow-rs#5847

Closed

comphead closed this as completed in #10801 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract parquet statistics from `Interval` columns #10752

Extract parquet statistics from `Interval` columns #10752

alamb commented Jun 1, 2024 •

edited

Loading

marvinlanhenke commented Jun 2, 2024

marvinlanhenke commented Jun 3, 2024

alamb commented Jun 4, 2024

marvinlanhenke commented Jun 4, 2024

marvinlanhenke commented Jun 5, 2024

Extract parquet statistics from Interval columns #10752

Extract parquet statistics from Interval columns #10752

Comments

alamb commented Jun 1, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

marvinlanhenke commented Jun 2, 2024

marvinlanhenke commented Jun 3, 2024

alamb commented Jun 4, 2024

marvinlanhenke commented Jun 4, 2024

marvinlanhenke commented Jun 5, 2024

Extract parquet statistics from `Interval` columns #10752

Extract parquet statistics from `Interval` columns #10752

alamb commented Jun 1, 2024 •

edited

Loading