test: add more tests for statistics reading #10592

NGA-TRAN · 2024-05-20T21:38:39Z

Which issue does this PR close?

More tests for #10453

Rationale for this change

Add tests for parquet statistics reading

What changes are included in this PR?

Tests and bug tickets linked to appropriate tests

Are these changes tested?

Yes

Are there any user-facing changes?

No

NGA-TRAN · 2024-05-21T16:18:38Z

@alamb : This PR is ready for review

comphead

thanks @NGA-TRAN good PR
Nice test description, this is important

What I still cannot understand is this a regression test for the bug we missed earlier?

NGA-TRAN · 2024-05-21T16:40:07Z

@comphead

What I still cannot understand is this a regression test for the bug we missed earlier?

I am working on new arrow statistics API #10453 and @alamb suggested me to add more coverage tests as well as moving tests for private functions in datafusion/core/src/datasource/physical_plan/parquet/statistics.rs to this file. Sorry that I am not aware of available/related bug tickets

alamb · 2024-05-21T18:10:02Z

What I still cannot understand is this a regression test for the bug we missed earlier?

My strong suspicion is that the bugs @NGA-TRAN is finding would manifest themselves as potentially incorrect results when reading parquet files predicates on these types of columns.

It may also simply manifest as not being able to prune row groups based on such predicates

I haven't worked to make a reproducer as it would likely require creating parquet files with multiple row groups with carefully chosen data patterns

alamb

Thank you @NGA-TRAN -- I think this is excellent test coverage

alamb · 2024-05-21T19:12:50Z

datafusion/core/tests/parquet/arrow_statistics.rs

+    let row_per_group = 5;
+    // This creates a parquet file of 1 column "decimal_col" with decimal data type and precicion 9, scale 2
+    // file has 3 record batches, each has 5 rows. They will be saved into 3 row groups
+    let reader = parquet_file_many_columns(Scenario::Decimal, row_per_group).await;


Another important thing to test with decimals is different precision / scales -- maybe we can do this as a different PR

@alamb : can you be more specific? In the test below, I had to make sure they have the same precision & scale. What else do we have to test here?

I was thinking smaller precisions -- I can't remember but I vaguely remember that spark stores different scale decimals with different underlying datatypes or something

alamb · 2024-05-21T19:13:37Z

datafusion/core/tests/parquet/arrow_statistics.rs

+            "frontend five",
+            "backend one",
+            "backend eight",
+        ])), // Shuld be BinaryArray


alamb · 2024-05-21T19:14:52Z

Since this PR is just tests, merging it in

* test: add more tests for statistics reading * Link bug tickets to the tests and run fmt

test: add more tests for statistics reading

c922ba1

NGA-TRAN marked this pull request as draft May 20, 2024 21:38

github-actions bot added the core Core DataFusion crate label May 20, 2024

This was referenced May 21, 2024

Incorrect statistics read for unsigned integer columns in parquet #10604

Closed

Incorrect statistics read for binary columns in parquet #10605

Closed

Link bug tickets to the tests and run fmt

d89c8c6

NGA-TRAN marked this pull request as ready for review May 21, 2024 16:15

comphead reviewed May 21, 2024

View reviewed changes

alamb approved these changes May 21, 2024

View reviewed changes

alamb merged commit 96e0ee6 into apache:main May 21, 2024
24 checks passed

alamb mentioned this pull request May 28, 2024

DataFusion weekly project plan (Andrew Lamb) - May 27, 2024 #10699

Closed

9 tasks

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024

test: add more tests for statistics reading (apache#10592)

d6ab64b

* test: add more tests for statistics reading * Link bug tickets to the tests and run fmt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add more tests for statistics reading #10592

test: add more tests for statistics reading #10592

NGA-TRAN commented May 20, 2024 •

edited

Loading

NGA-TRAN commented May 21, 2024

comphead left a comment

NGA-TRAN commented May 21, 2024

alamb commented May 21, 2024

alamb left a comment

alamb May 21, 2024

NGA-TRAN May 21, 2024

alamb May 22, 2024

alamb May 21, 2024

alamb commented May 21, 2024

test: add more tests for statistics reading #10592

test: add more tests for statistics reading #10592

Conversation

NGA-TRAN commented May 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

NGA-TRAN commented May 21, 2024

comphead left a comment

Choose a reason for hiding this comment

NGA-TRAN commented May 21, 2024

alamb commented May 21, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

NGA-TRAN May 21, 2024

Choose a reason for hiding this comment

alamb May 22, 2024

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

alamb commented May 21, 2024

NGA-TRAN commented May 20, 2024 •

edited

Loading