-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Implement capability to restore non-nullability in Features #7482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Interestingly, this does not close #7479. The Features are not correctly maintained when calling |
Unfortunately this PR does not fix the reported issue. After more digging:
Interestingly, passing custom Features does not immediately load the underlying data with the right arrow_schema. Instead, the workflow is like this:
So I figured, since many/all of the pyarrow datasets/src/datasets/arrow_dataset.py Line 940 in 5f8d2ad
to include the arrow_schema, if provided: pa_table = InMemoryTable.from_pydict(mapping=mapping, schema=features.arrow_schema if features is not None else None) But that leads to:
and I am not too familiar with pyarrow to solve this. So ultimately I'm a bit at a loss here. I think, if we'd want to do this right, the automatic casting in init should be removed in favor of handling the logic inside |
It's indeed a bit more work to support nullable since in addition to your comments, there are unclear behavior when it comes to concatenating nullable with non-nullable, and maybe how to handle non-nullable lists and nested data. But yup I agree having the Just one comment about this error:
This happens because |
This PR attempts to keep track of non_nullable pyarrow fields when converting a
pa.Schema
toFeatures
. At the same time, when outputting thearrow_schema
, the original non-nullable fields are restored. This allows for more consistent behavior and avoids breaking behavior as illustrated in #7479.I am by no means a pyarrow expert so some logic in
find_non_nullable_fields
may not perfect. Not sure if more logic (type checks) are needed for deep-checking a given schema. Maybe there are other pyarrow structures that need to be covered?Tests are added, but again, these may not have sufficient coverage in terms of pyarrow structure types.
closes #7479