Add bench for data page statistics parquet extraction #10950

marvinlanhenke · 2024-06-17T11:22:09Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

marvinlanhenke · 2024-06-17T11:24:40Z

@alamb
...since not all DataTypes are supported yet, some types panic in the bench.
Should we simply wait until the missing data types are supported and then merge this PR?

alamb · 2024-06-17T13:18:58Z

Should we simply wait until the missing data types are supported and then merge this PR?

I think that makes sense to me

The other thing we could do is somehow ignore those columns until the support has been added

alamb

Looks good to me -- thank you @marvinlanhenke

datafusion/core/benches/parquet_statistic.rs

marvinlanhenke · 2024-06-17T14:23:21Z

The other thing we could do is somehow ignore those columns until the support has been added

yes, but this would require another PR to revert back those changes. I think adding the bench is not so urgent; so merging once other dataypes are ready might be the easiest thing to do.

I've also addressed your other comments; should be fine now. Thanks for the review.

alamb · 2024-06-17T20:24:21Z

Thanks again @marvinlanhenke -- this PR looks good to me. Per your suggestion, let's wait until the required type support has been added

efredine

I checked this out and merged main and it runs without errors now, so it should be safe to merge.

alamb · 2024-07-03T21:10:01Z

Thanks for checking this out @efredine -- I merged up from main and updated this PR to get it moving. Once it passes CI I think it is good to go

Thanks again @marvinlanhenke

Like @efredine I verified the benchmark now works without error:

cargo bench --bench parquet_statistic
   Compiling bigdecimal v0.4.1
   Compiling datafusion v39.0.0 (/Users/andrewlamb/Software/datafusion2/datafusion/core)
    Finished `bench` profile [optimized] target(s) in 1m 34s
     Running benches/parquet_statistic.rs (target/release/deps/parquet_statistic-c6fce472dea5abe8)
Gnuplot not found, using plotters backend
Extract row group statistics for Int64/extract_statistics/Int64
                        time:   [594.98 ns 596.23 ns 597.66 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

Extract data page statistics for Int64/extract_statistics/Int64
                        time:   [6.5665 µs 6.5848 µs 6.6047 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Extract row group statistics for UInt64/extract_statistics/UInt64
                        time:   [576.78 ns 578.78 ns 581.09 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Extract data page statistics for UInt64/extract_statistics/UInt64
                        time:   [6.8120 µs 6.8332 µs 6.8559 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

Extract row group statistics for F64/extract_statistics/F64
                        time:   [588.96 ns 592.68 ns 596.62 ns]

Extract data page statistics for F64/extract_statistics/F64
                        time:   [7.5959 µs 7.6334 µs 7.6650 µs]

Extract row group statistics for String/extract_statistics/String
                        time:   [897.07 ns 901.70 ns 907.19 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Extract data page statistics for String/extract_statistics/String
                        time:   [25.507 µs 25.555 µs 25.609 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...: Collecting 100 samples in estimated 5.00
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...
                        time:   [947.78 ns 954.30 ns 960.82 ns]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild

Benchmarking Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...: Collecting 100 samples in estimated 5.04
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...
                        time:   [25.602 µs 25.812 µs 26.109 µs]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

* feat: add data page bench * chore: add comment * fix: row_groups + shorten row_group_indices --------- Co-authored-by: Andrew Lamb <[email protected]>

feat: add data page bench

78f4bcc

github-actions bot added the core Core DataFusion crate label Jun 17, 2024

alamb approved these changes Jun 17, 2024

View reviewed changes

datafusion/core/benches/parquet_statistic.rs Show resolved Hide resolved

datafusion/core/benches/parquet_statistic.rs Outdated Show resolved Hide resolved

datafusion/core/benches/parquet_statistic.rs Outdated Show resolved Hide resolved

chore: add comment

bcd1407

fix: row_groups + shorten row_group_indices

1f8f2a4

alamb marked this pull request as draft June 17, 2024 20:24

efredine approved these changes Jul 3, 2024

View reviewed changes

Merge remote-tracking branch 'apache/main' into add_benchmark_pq_stats

6f92be1

alamb marked this pull request as ready for review July 3, 2024 21:06

alamb merged commit b4afa18 into apache:main Jul 3, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bench for data page statistics parquet extraction #10950

Add bench for data page statistics parquet extraction #10950

marvinlanhenke commented Jun 17, 2024

marvinlanhenke commented Jun 17, 2024 •

edited

Loading

alamb commented Jun 17, 2024 •

edited

Loading

alamb left a comment

marvinlanhenke commented Jun 17, 2024

alamb commented Jun 17, 2024

efredine left a comment

alamb commented Jul 3, 2024

Add bench for data page statistics parquet extraction #10950

Add bench for data page statistics parquet extraction #10950

Conversation

marvinlanhenke commented Jun 17, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

marvinlanhenke commented Jun 17, 2024 • edited Loading

alamb commented Jun 17, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

marvinlanhenke commented Jun 17, 2024

alamb commented Jun 17, 2024

efredine left a comment

Choose a reason for hiding this comment

alamb commented Jul 3, 2024

marvinlanhenke commented Jun 17, 2024 •

edited

Loading

alamb commented Jun 17, 2024 •

edited

Loading