-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add more tests for statistics reading #10592
Conversation
@alamb : This PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @NGA-TRAN good PR
Nice test description, this is important
What I still cannot understand is this a regression test for the bug we missed earlier?
I am working on new arrow statistics API #10453 and @alamb suggested me to add more coverage tests as well as moving tests for private functions in datafusion/core/src/datasource/physical_plan/parquet/statistics.rs to this file. Sorry that I am not aware of available/related bug tickets |
My strong suspicion is that the bugs @NGA-TRAN is finding would manifest themselves as potentially incorrect results when reading parquet files predicates on these types of columns. It may also simply manifest as not being able to prune row groups based on such predicates I haven't worked to make a reproducer as it would likely require creating parquet files with multiple row groups with carefully chosen data patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @NGA-TRAN -- I think this is excellent test coverage
let row_per_group = 5; | ||
// This creates a parquet file of 1 column "decimal_col" with decimal data type and precicion 9, scale 2 | ||
// file has 3 record batches, each has 5 rows. They will be saved into 3 row groups | ||
let reader = parquet_file_many_columns(Scenario::Decimal, row_per_group).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another important thing to test with decimals is different precision / scales -- maybe we can do this as a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb : can you be more specific? In the test below, I had to make sure they have the same precision & scale. What else do we have to test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking smaller precisions -- I can't remember but I vaguely remember that spark stores different scale decimals with different underlying datatypes or something
"frontend five", | ||
"backend one", | ||
"backend eight", | ||
])), // Shuld be BinaryArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Since this PR is just tests, merging it in |
* test: add more tests for statistics reading * Link bug tickets to the tests and run fmt
Which issue does this PR close?
More tests for #10453
Rationale for this change
Add tests for parquet statistics reading
What changes are included in this PR?
Tests and bug tickets linked to appropriate tests
Are these changes tested?
Yes
Are there any user-facing changes?
No