Skip to content

Arrow parquet reader produces incomplete nested NULL masks #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
scovich opened this issue Feb 12, 2025 · 0 comments · Fixed by #692
Closed

Arrow parquet reader produces incomplete nested NULL masks #691

scovich opened this issue Feb 12, 2025 · 0 comments · Fixed by #692

Comments

@scovich
Copy link
Collaborator

scovich commented Feb 12, 2025

Describe the bug

See also: apache/arrow-rs#7119

Starting in arrow-rs v53.3, the parquet reader stops populating NULL masks for non-nullable leaf columns -- even if an ancestor is nullable.

It turns out arrow-rs never guaranteed that NULL masks would be complete or trustworthy in this way, which breaks a pretty basic assumption of the kernel RowVisitor API.

The immediate symptom is reading uninitialized garbage values from non-nullable columns nested inside nullable columns, in rows where the parent is NULL -- a very common scenario when consuming Delta log actions.

To Reproduce

See the arrow issue above for a minimal repro.

Expected behavior

A nested column projected out of a struct should have accurate NULL masks, accounting for NULL parents.

Additional context

No response

@scovich scovich added the bug label Feb 12, 2025
@scovich scovich changed the title Arrow parquet reader produces incomplete NULL masks Arrow parquet reader produces incomplete nested NULL masks Feb 12, 2025
sebastiantia pushed a commit to sebastiantia/delta-kernel-rs that referenced this issue Feb 19, 2025
## What changes are proposed in this pull request?

Starting in arrow-53.3, the parquet reader no longer computes NULL masks
for non-nullable leaf columns -- even if they have nullable ancestors.
This breaks row visitors, who rely on each leaf column to have a fully
accurate NULL mask.

The quick-fix solution is to manually fixup the null masks of every
`RecordBatch` that comes from the parquet reader.

Fixes delta-io#691

## How was this change tested?

New unit test that checks whether parquet reads produce properly nested
NULL masks. The test also leverages (and verifies) the JSON parser, so
we can reliably detect any unwelcome behavior changes to JSON parsing
that might land in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant