Skip to content

Fix: StatisticsConverter counts for missing columns #10946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 17, 2024

Conversation

marvinlanhenke
Copy link
Contributor

Which issue does this PR close?

Closes #10926.

Rationale for this change

  • row_group_null_counts, data_page_null_counts
  • data_page_row_counts

return an ArrayRef instead of UInt64Array.

What changes are included in this PR?

  • fixed methods to return correct array type
  • changed data_page_row_counts to fall back on row_page_row_counts if column is missing
  • added a test case in arrow_statistics

Are these changes tested?

Yes.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jun 17, 2024
@marvinlanhenke
Copy link
Contributor Author

@alamb PTAL.

I'm not sure about the assumptions I made in data_page_row_counts here.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @marvinlanhenke -- I think this looks great to me

I see the complication related to data_page_row_counts. I don't have a great answer for that at the moment other than potentially to return an error.

I believe this PR could be merged in as is and we can update the behavior in a follow on PR (or leave it as is for a while). Let me know what you think

// thus we cannot extract page_locations in order to determine
// the row count on a per DataPage basis.
// We use `row_group_row_counts` instead.
return Self::row_group_row_counts(row_group_metadatas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- this is a tricky situation where there is no column and thus no information on data pages.

Another potential behavior that might make sense here would be to return an error because unlike other functions in StatisticsConverter there is no way to "gracefully" ignore missing information

Or we could possible return an array with zero rows 🤔

Copy link
Contributor Author

@marvinlanhenke marvinlanhenke Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...perhaps looking at the behavior of row_group_row_count might help.

My main question here is:
Why do we return row counts for a non existing column in row_group_row_counts?

A column that does not exists cannot have valid rows, or a row count in that regard? So returning a null_array would indicate those missing information? Suppose we had just a single column, and for whatever reason, it does not exist in the parquet file (or does not match). With the current implementation we would still return a 'valid' row count in this scenario, accessing a somewhat invalid / useless parquet file?

I'm not sure one can follow my train of thought here, however I think we should return a null_array in both cases for row_groups and data_pages as well.

But, I might be missing something. Perhaps you can explain the reasoning why we don't return a null_array in the current row_group_row_counts impl @alamb? That would be nice and might help guide our decision here.

(If we cannot decide, I think returning an Error would be the best option)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return row counts for a non existing column in row_group_row_counts?

I think the (not great) reason is that it is the API needed for PruningStatistics here:

fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
// row counts are the same for all columns in a row group
StatisticsConverter::row_group_row_counts(self.metadata_iter())
.ok()
.map(|counts| Arc::new(counts) as ArrayRef)

Also, since the ParquetMetadata knows how many row groups there are even when there are no row group statistics, it is possible

For data pages, it is different as if the "page index" is not present then I don't think there is any way to know how many data pages there are other than reading the file

The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in row_group_row_counts either

So I guess my new proposal would be to return Option like:

impl<'a> StatisticsConverter<'a> {
...
  /// return OK(None) if the column does not exist
  pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> {
  ...
  }

  /// return OK(None) if the column does not exist
    pub fn data_page_row_counts<I>(
        &self,
        column_offset_index: &ParquetOffsetIndex,
        row_group_metadatas: &[RowGroupMetaData],
        row_group_indices: I,
    ) -> Result<Option<UInt64Array>>
    where
        I: IntoIterator<Item = &'a usize>,
    {
    ...
    }

}

The rationale to return an Option rather than an error is that creating and ignoring DataFusionError via ok() still requires a string allocation, which is wasteful

I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in row_group_row_counts either

But I'll guess this is a topic (or PR) for another day?

So for now, we only change data_page_row_counts and merge this? I'd be fine with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for now, we only change data_page_row_counts and merge this? I'd be fine with that.

I think that is a good idea and I will file a ticket to fix row_group_row_counts in a follow on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it accordingly. Thanks again for your input here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10965 to track the changes to row_group_row_counts

@marvinlanhenke
Copy link
Contributor Author

Thank you very much @marvinlanhenke -- I think this looks great to me

I see the complication related to data_page_row_counts. I don't have a great answer for that at the moment other than potentially to return an error.

I believe this PR could be merged in as is and we can update the behavior in a follow on PR (or leave it as is for a while). Let me know what you think

@alamb Thanks for the review. I left one question regarding the current impl of row_group_row_counts ... if this question does not help regarding data_pages_row_count, I'm a +1 for returning an error instead.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @marvinlanhenke 🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb merged commit 1cb0057 into apache:main Jun 17, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: add run_with_schema + add test_case

* fix: null_counts

* fix: row_counts

* refactor: change return type of data_page_row_counts

* refactor: shorten row_group_indices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StatisticsConverter::row_group_null_counts incorrect for missing column
2 participants