-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-45185: [C++][Parquet] Raise an error for invalid repetition levels when delimiting records #45186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note for others taking a look, tests won't be successful until we merge the parquet-testing PR |
be347da
to
22794e2
Compare
The failing tests all look to be caused by #45305 rather than this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've check the code, and I found that check the rep-levels here is ok, since check it in other places is nearly impossible here 😂
TEST(TestArrowReaderAdHoc, InvalidRepetitionLevels) { | ||
// GH-45185 - Repetition levels start with 1 instead of 0 | ||
auto path = test::get_data_file("ARROW-GH-45185.parquet", /*is_good=*/false); | ||
TryReadDataFile(path, ::arrow::StatusCode::IOError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind also check the status is "The repetition level at the start of a record must be 0 but got ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done
cpp/src/parquet/column_reader.cc
Outdated
@@ -1611,7 +1611,12 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>, | |||
// another record start or exhausting the ColumnChunk | |||
int64_t level = levels_position_; | |||
if (at_record_start_) { | |||
ARROW_DCHECK_EQ(0, rep_levels[levels_position_]); | |||
if (rep_levels[levels_position_] != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use ARROW_PREDICT_FALSE
to check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed
Oh my I found this is not merged...I'll rebase and try to merge this |
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 85779a4. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Thanks! |
…levels when delimiting records (apache#45186) ### Rationale for this change See apache#45185. Invalid repetition levels would previously only cause a fatal error in debug builds. ### What changes are included in this PR? Converts an existing `ARROW_DCHECK_EQ` of the repetition level with a check that will raise an exception in release builds too. ### Are these changes tested? Yes, using a new example file (apache/parquet-testing#67) ### Are there any user-facing changes? Yes, reading columns with invalid repetition levels as Arrow arrays will now raise an exception. * GitHub Issue: apache#45185 Lead-authored-by: Adam Reeve <[email protected]> Co-authored-by: mwish <[email protected]> Signed-off-by: mwish <[email protected]>
…levels when delimiting records (apache#45186) ### Rationale for this change See apache#45185. Invalid repetition levels would previously only cause a fatal error in debug builds. ### What changes are included in this PR? Converts an existing `ARROW_DCHECK_EQ` of the repetition level with a check that will raise an exception in release builds too. ### Are these changes tested? Yes, using a new example file (apache/parquet-testing#67) ### Are there any user-facing changes? Yes, reading columns with invalid repetition levels as Arrow arrays will now raise an exception. * GitHub Issue: apache#45185 Lead-authored-by: Adam Reeve <[email protected]> Co-authored-by: mwish <[email protected]> Signed-off-by: mwish <[email protected]>
Rationale for this change
See #45185. Invalid repetition levels would previously only cause a fatal error in debug builds.
What changes are included in this PR?
Converts an existing
ARROW_DCHECK_EQ
of the repetition level with a check that will raise an exception in release builds too.Are these changes tested?
Yes, using a new example file (apache/parquet-testing#67)
Are there any user-facing changes?
Yes, reading columns with invalid repetition levels as Arrow arrays will now raise an exception.