Skip to content

Implement capability to restore non-nullability in Features #7482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

BramVanroy
Copy link
Contributor

This PR attempts to keep track of non_nullable pyarrow fields when converting a pa.Schema to Features. At the same time, when outputting the arrow_schema, the original non-nullable fields are restored. This allows for more consistent behavior and avoids breaking behavior as illustrated in #7479.

I am by no means a pyarrow expert so some logic in find_non_nullable_fields may not perfect. Not sure if more logic (type checks) are needed for deep-checking a given schema. Maybe there are other pyarrow structures that need to be covered?

Tests are added, but again, these may not have sufficient coverage in terms of pyarrow structure types.

closes #7479

@BramVanroy
Copy link
Contributor Author

BramVanroy commented Mar 27, 2025

Interestingly, this does not close #7479. The Features are not correctly maintained when calling from_dict with the custom Features.

@BramVanroy
Copy link
Contributor Author

Unfortunately this PR does not fix the reported issue. After more digging:

  • when the dataset is created, nullability information is lost in Features;
  • even with this PR, it will get lost eventually because of internal copying/recreation of the Features object without accounting for the nullable fields;
  • even if that is also fixed, and Features.arrow_schema correctly holds the nullability info, casting the arrow Table with a less strict schema to a more strict one (with nullability) will fail (only on deeper structs, not on flat fields).

Interestingly, passing custom Features does not immediately load the underlying data with the right arrow_schema. Instead, the workflow is like this:

  • load pyarrow table with any of the methods (from_dict, from_pandas, etc.), which will always AUTO INFER rather than use a provided schema
  • the loaded table with auto-schema will be used to initialize the Dataset class, and only during construction will CAST the table to the user-provided schema if needed, if it differs from the auto-inferred one.

So I figured, since many/all of the pyarrow Table.from_* methods have a schema= argument, we should already load the Table with the correct schema to begin with. As an example, I tried changing this line:

pa_table = InMemoryTable.from_pydict(mapping=mapping)

to include the arrow_schema, if provided:

pa_table = InMemoryTable.from_pydict(mapping=mapping, schema=features.arrow_schema if features is not None else None)

But that leads to:

Traceback (most recent call last):
  File "/home/ampere/vanroy/datasets/scratch.py", line 33, in <module>
    ds = Dataset.from_dict(
         ^^^^^^^^^^^^^^^^^^
  File "/home/local/vanroy/datasets/src/datasets/arrow_dataset.py", line 957, in from_dict
    pa_table = InMemoryTable.from_pydict(mapping=mapping, schema=features.arrow_schema if features is not None else None)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/vanroy/datasets/src/datasets/table.py", line 758, in from_pydict
    return cls(pa.Table.from_pydict(*args, **kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 1968, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 6354, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 402, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 252, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/local/vanroy/datasets/src/datasets/arrow_writer.py", line 201, in __arrow_array__
    raise ValueError("TypedSequence is supposed to be used with pa.array(typed_sequence, type=None)")
ValueError: TypedSequence is supposed to be used with pa.array(typed_sequence, type=None)

and I am not too familiar with pyarrow to solve this.

So ultimately I'm a bit at a loss here. I think, if we'd want to do this right, the automatic casting in init should be removed in favor of handling the logic inside Dataset.from_*, by passing the schema explicitly to pa.Table.from_*(..., schema=schema). But I lack the knowledge of pyarrow to go further than what I've written about above.

@lhoestq
Copy link
Member

lhoestq commented Mar 27, 2025

It's indeed a bit more work to support nullable since in addition to your comments, there are unclear behavior when it comes to concatenating nullable with non-nullable, and maybe how to handle non-nullable lists and nested data.

But yup I agree having the Dataset.from_* function pass the schema to the pa.Table.from* would be the way.

Just one comment about this error:

ValueError: TypedSequence is supposed to be used with pa.array(typed_sequence, type=None)

This happens because Dataset.from_dict uses OptimizedTypedSequence by default, which should only be used if the user doesn't specify a schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Features.from_arrow_schema is destructive
2 participants