-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix: StatisticsConverter counts
for missing columns
#10946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @marvinlanhenke -- I think this looks great to me
I see the complication related to data_page_row_counts. I don't have a great answer for that at the moment other than potentially to return an error.
I believe this PR could be merged in as is and we can update the behavior in a follow on PR (or leave it as is for a while). Let me know what you think
// thus we cannot extract page_locations in order to determine | ||
// the row count on a per DataPage basis. | ||
// We use `row_group_row_counts` instead. | ||
return Self::row_group_row_counts(row_group_metadatas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- this is a tricky situation where there is no column and thus no information on data pages.
Another potential behavior that might make sense here would be to return an error because unlike other functions in StatisticsConverter
there is no way to "gracefully" ignore missing information
Or we could possible return an array with zero rows 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...perhaps looking at the behavior of row_group_row_count
might help.
My main question here is:
Why do we return row counts for a non existing column in row_group_row_counts
?
A column that does not exists cannot have valid rows, or a row count in that regard? So returning a null_array would indicate those missing information? Suppose we had just a single column, and for whatever reason, it does not exist in the parquet file (or does not match). With the current implementation we would still return a 'valid' row count in this scenario, accessing a somewhat invalid / useless parquet file?
I'm not sure one can follow my train of thought here, however I think we should return a null_array in both cases for row_groups
and data_pages
as well.
But, I might be missing something. Perhaps you can explain the reasoning why we don't return a null_array in the current row_group_row_counts
impl @alamb? That would be nice and might help guide our decision here.
(If we cannot decide, I think returning an Error would be the best option)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we return row counts for a non existing column in row_group_row_counts?
I think the (not great) reason is that it is the API needed for PruningStatistics here:
datafusion/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Lines 386 to 390 in a923c65
fn row_counts(&self, _column: &Column) -> Option<ArrayRef> { | |
// row counts are the same for all columns in a row group | |
StatisticsConverter::row_group_row_counts(self.metadata_iter()) | |
.ok() | |
.map(|counts| Arc::new(counts) as ArrayRef) |
Also, since the ParquetMetadata
knows how many row groups there are even when there are no row group statistics, it is possible
For data pages, it is different as if the "page index" is not present then I don't think there is any way to know how many data pages there are other than reading the file
The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in row_group_row_counts
either
So I guess my new proposal would be to return Option
like:
impl<'a> StatisticsConverter<'a> {
...
/// return OK(None) if the column does not exist
pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> {
...
}
/// return OK(None) if the column does not exist
pub fn data_page_row_counts<I>(
&self,
column_offset_index: &ParquetOffsetIndex,
row_group_metadatas: &[RowGroupMetaData],
row_group_indices: I,
) -> Result<Option<UInt64Array>>
where
I: IntoIterator<Item = &'a usize>,
{
...
}
}
The rationale to return an Option rather than an error is that creating and ignoring DataFusionError
via ok()
still requires a string allocation, which is wasteful
I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in
row_group_row_counts
either
But I'll guess this is a topic (or PR) for another day?
So for now, we only change data_page_row_counts
and merge this? I'd be fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for now, we only change data_page_row_counts and merge this? I'd be fine with that.
I think that is a good idea and I will file a ticket to fix row_group_row_counts
in a follow on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it accordingly. Thanks again for your input here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10965 to track the changes to row_group_row_counts
@alamb Thanks for the review. I left one question regarding the current impl of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @marvinlanhenke 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marvinlanhenke
* feat: add run_with_schema + add test_case * fix: null_counts * fix: row_counts * refactor: change return type of data_page_row_counts * refactor: shorten row_group_indices
Which issue does this PR close?
Closes #10926.
Rationale for this change
return an ArrayRef instead of UInt64Array.
What changes are included in this PR?
arrow_statistics
Are these changes tested?
Yes.
Are there any user-facing changes?