Fix: StatisticsConverter `counts` for missing columns #10946

marvinlanhenke · 2024-06-17T06:40:10Z

Which issue does this PR close?

Closes #10926.

Rationale for this change

row_group_null_counts, data_page_null_counts
data_page_row_counts

return an ArrayRef instead of UInt64Array.

What changes are included in this PR?

fixed methods to return correct array type
changed data_page_row_counts to fall back on row_page_row_counts if column is missing
added a test case in arrow_statistics

Are these changes tested?

Yes.

Are there any user-facing changes?

marvinlanhenke · 2024-06-17T06:41:40Z

@alamb PTAL.

I'm not sure about the assumptions I made in data_page_row_counts here.

alamb

Thank you very much @marvinlanhenke -- I think this looks great to me

I see the complication related to data_page_row_counts. I don't have a great answer for that at the moment other than potentially to return an error.

I believe this PR could be merged in as is and we can update the behavior in a follow on PR (or leave it as is for a while). Let me know what you think

alamb · 2024-06-17T10:48:39Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+            // thus we cannot extract page_locations in order to determine
+            // the row count on a per DataPage basis.
+            // We use `row_group_row_counts` instead.
+            return Self::row_group_row_counts(row_group_metadatas);


I see -- this is a tricky situation where there is no column and thus no information on data pages.

Another potential behavior that might make sense here would be to return an error because unlike other functions in StatisticsConverter there is no way to "gracefully" ignore missing information

Or we could possible return an array with zero rows 🤔

...perhaps looking at the behavior of row_group_row_count might help.

My main question here is:
Why do we return row counts for a non existing column in row_group_row_counts?

A column that does not exists cannot have valid rows, or a row count in that regard? So returning a null_array would indicate those missing information? Suppose we had just a single column, and for whatever reason, it does not exist in the parquet file (or does not match). With the current implementation we would still return a 'valid' row count in this scenario, accessing a somewhat invalid / useless parquet file?

I'm not sure one can follow my train of thought here, however I think we should return a null_array in both cases for row_groups and data_pages as well.

But, I might be missing something. Perhaps you can explain the reasoning why we don't return a null_array in the current row_group_row_counts impl @alamb? That would be nice and might help guide our decision here.

(If we cannot decide, I think returning an Error would be the best option)

Why do we return row counts for a non existing column in row_group_row_counts?

I think the (not great) reason is that it is the API needed for PruningStatistics here:

datafusion/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

Lines 386 to 390 in a923c65

fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {

// row counts are the same for all columns in a row group

StatisticsConverter::row_group_row_counts(self.metadata_iter())

.ok()

.map(|counts| Arc::new(counts) as ArrayRef)

Also, since the ParquetMetadata knows how many row groups there are even when there are no row group statistics, it is possible

For data pages, it is different as if the "page index" is not present then I don't think there is any way to know how many data pages there are other than reading the file

The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in row_group_row_counts either

So I guess my new proposal would be to return Option like:

impl<'a> StatisticsConverter<'a> { ... /// return OK(None) if the column does not exist pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> { ... } /// return OK(None) if the column does not exist pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, row_group_metadatas: &[RowGroupMetaData], row_group_indices: I, ) -> Result<Option<UInt64Array>> where I: IntoIterator<Item = &'a usize>, { ... } }

The rationale to return an Option rather than an error is that creating and ignoring DataFusionError via ok() still requires a string allocation, which is wasteful

I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen

The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in row_group_row_counts either

But I'll guess this is a topic (or PR) for another day?

So for now, we only change data_page_row_counts and merge this? I'd be fine with that.

So for now, we only change data_page_row_counts and merge this? I'd be fine with that.

I think that is a good idea and I will file a ticket to fix row_group_row_counts in a follow on

I changed it accordingly. Thanks again for your input here

Filed #10965 to track the changes to row_group_row_counts

datafusion/core/tests/parquet/arrow_statistics.rs

marvinlanhenke · 2024-06-17T11:40:12Z

Thank you very much @marvinlanhenke -- I think this looks great to me

I see the complication related to data_page_row_counts. I don't have a great answer for that at the moment other than potentially to return an error.

I believe this PR could be merged in as is and we can update the behavior in a follow on PR (or leave it as is for a while). Let me know what you think

@alamb Thanks for the review. I left one question regarding the current impl of row_group_row_counts ... if this question does not help regarding data_pages_row_count, I'm a +1 for returning an error instead.

alamb

Thanks again @marvinlanhenke 🚀

alamb

Thanks @marvinlanhenke

* feat: add run_with_schema + add test_case * fix: null_counts * fix: row_counts * refactor: change return type of data_page_row_counts * refactor: shorten row_group_indices

marvinlanhenke added 3 commits June 17, 2024 07:42

feat: add run_with_schema + add test_case

2736ed6

fix: null_counts

4a77c08

fix: row_counts

585e78f

github-actions bot added the core Core DataFusion crate label Jun 17, 2024

alamb approved these changes Jun 17, 2024

View reviewed changes

marvinlanhenke added 2 commits June 17, 2024 16:03

refactor: change return type of data_page_row_counts

dd5519d

refactor: shorten row_group_indices

b10da44

alamb approved these changes Jun 17, 2024

View reviewed changes

alamb mentioned this pull request Jun 17, 2024

Change StatisticsConverter::row_group_counts to return None for non existent columns in parquet files #10965

Closed

alamb approved these changes Jun 17, 2024

View reviewed changes

alamb merged commit 1cb0057 into apache:main Jun 17, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: StatisticsConverter `counts` for missing columns #10946

Fix: StatisticsConverter `counts` for missing columns #10946

marvinlanhenke commented Jun 17, 2024

marvinlanhenke commented Jun 17, 2024

alamb left a comment

alamb Jun 17, 2024

marvinlanhenke Jun 17, 2024 •

edited

Loading

alamb Jun 17, 2024

marvinlanhenke Jun 17, 2024

alamb Jun 17, 2024

marvinlanhenke Jun 17, 2024

alamb Jun 17, 2024

marvinlanhenke commented Jun 17, 2024

alamb left a comment

alamb left a comment

	fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
	// row counts are the same for all columns in a row group
	StatisticsConverter::row_group_row_counts(self.metadata_iter())
	.ok()
	.map(\|counts\| Arc::new(counts) as ArrayRef)

Fix: StatisticsConverter counts for missing columns #10946

Fix: StatisticsConverter counts for missing columns #10946

Conversation

marvinlanhenke commented Jun 17, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

marvinlanhenke commented Jun 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 17, 2024

Choose a reason for hiding this comment

marvinlanhenke Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jun 17, 2024

Choose a reason for hiding this comment

marvinlanhenke Jun 17, 2024

Choose a reason for hiding this comment

alamb Jun 17, 2024

Choose a reason for hiding this comment

marvinlanhenke Jun 17, 2024

Choose a reason for hiding this comment

alamb Jun 17, 2024

Choose a reason for hiding this comment

marvinlanhenke commented Jun 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Fix: StatisticsConverter `counts` for missing columns #10946

Fix: StatisticsConverter `counts` for missing columns #10946

marvinlanhenke Jun 17, 2024 •

edited

Loading