You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Starting in arrow-rs v53.3, the parquet reader stops populating NULL masks for non-nullable leaf columns -- even if an ancestor is nullable.
It turns out arrow-rs never guaranteed that NULL masks would be complete or trustworthy in this way, which breaks a pretty basic assumption of the kernel RowVisitor API.
The immediate symptom is reading uninitialized garbage values from non-nullable columns nested inside nullable columns, in rows where the parent is NULL -- a very common scenario when consuming Delta log actions.
To Reproduce
See the arrow issue above for a minimal repro.
Expected behavior
A nested column projected out of a struct should have accurate NULL masks, accounting for NULL parents.
Additional context
No response
The text was updated successfully, but these errors were encountered:
## What changes are proposed in this pull request?
Starting in arrow-53.3, the parquet reader no longer computes NULL masks
for non-nullable leaf columns -- even if they have nullable ancestors.
This breaks row visitors, who rely on each leaf column to have a fully
accurate NULL mask.
The quick-fix solution is to manually fixup the null masks of every
`RecordBatch` that comes from the parquet reader.
Fixesdelta-io#691
## How was this change tested?
New unit test that checks whether parquet reads produce properly nested
NULL masks. The test also leverages (and verifies) the JSON parser, so
we can reliably detect any unwelcome behavior changes to JSON parsing
that might land in the future.
Describe the bug
See also: apache/arrow-rs#7119
Starting in arrow-rs v53.3, the parquet reader stops populating NULL masks for non-nullable leaf columns -- even if an ancestor is nullable.
It turns out arrow-rs never guaranteed that NULL masks would be complete or trustworthy in this way, which breaks a pretty basic assumption of the kernel
RowVisitor
API.The immediate symptom is reading uninitialized garbage values from non-nullable columns nested inside nullable columns, in rows where the parent is NULL -- a very common scenario when consuming Delta log actions.
To Reproduce
See the arrow issue above for a minimal repro.
Expected behavior
A nested column projected out of a struct should have accurate NULL masks, accounting for NULL parents.
Additional context
No response
The text was updated successfully, but these errors were encountered: