Implement capability to restore non-nullability in Features #7482

BramVanroy · 2025-03-26T22:16:09Z

This PR attempts to keep track of non_nullable pyarrow fields when converting a pa.Schema to Features. At the same time, when outputting the arrow_schema, the original non-nullable fields are restored. This allows for more consistent behavior and avoids breaking behavior as illustrated in #7479.

I am by no means a pyarrow expert so some logic in find_non_nullable_fields may not perfect. Not sure if more logic (type checks) are needed for deep-checking a given schema. Maybe there are other pyarrow structures that need to be covered?

Tests are added, but again, these may not have sufficient coverage in terms of pyarrow structure types.

closes #7479

BramVanroy · 2025-03-27T08:35:15Z

Interestingly, this does not close #7479. The Features are not correctly maintained when calling from_dict with the custom Features.

BramVanroy · 2025-03-27T10:43:56Z

Unfortunately this PR does not fix the reported issue. After more digging:

when the dataset is created, nullability information is lost in Features;
even with this PR, it will get lost eventually because of internal copying/recreation of the Features object without accounting for the nullable fields;
even if that is also fixed, and Features.arrow_schema correctly holds the nullability info, casting the arrow Table with a less strict schema to a more strict one (with nullability) will fail (only on deeper structs, not on flat fields).

Interestingly, passing custom Features does not immediately load the underlying data with the right arrow_schema. Instead, the workflow is like this:

load pyarrow table with any of the methods (from_dict, from_pandas, etc.), which will always AUTO INFER rather than use a provided schema
the loaded table with auto-schema will be used to initialize the Dataset class, and only during construction will CAST the table to the user-provided schema if needed, if it differs from the auto-inferred one.

So I figured, since many/all of the pyarrow Table.from_* methods have a schema= argument, we should already load the Table with the correct schema to begin with. As an example, I tried changing this line:

datasets/src/datasets/arrow_dataset.py

Line 940 in 5f8d2ad

pa_table = InMemoryTable.from_pydict(mapping=mapping)

to include the arrow_schema, if provided:

pa_table = InMemoryTable.from_pydict(mapping=mapping, schema=features.arrow_schema if features is not None else None)

But that leads to:

Traceback (most recent call last):
  File "/home/ampere/vanroy/datasets/scratch.py", line 33, in <module>
    ds = Dataset.from_dict(
         ^^^^^^^^^^^^^^^^^^
  File "/home/local/vanroy/datasets/src/datasets/arrow_dataset.py", line 957, in from_dict
    pa_table = InMemoryTable.from_pydict(mapping=mapping, schema=features.arrow_schema if features is not None else None)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/vanroy/datasets/src/datasets/table.py", line 758, in from_pydict
    return cls(pa.Table.from_pydict(*args, **kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 1968, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 6354, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 402, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 252, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/local/vanroy/datasets/src/datasets/arrow_writer.py", line 201, in __arrow_array__
    raise ValueError("TypedSequence is supposed to be used with pa.array(typed_sequence, type=None)")
ValueError: TypedSequence is supposed to be used with pa.array(typed_sequence, type=None)

and I am not too familiar with pyarrow to solve this.

So ultimately I'm a bit at a loss here. I think, if we'd want to do this right, the automatic casting in init should be removed in favor of handling the logic inside Dataset.from_*, by passing the schema explicitly to pa.Table.from_*(..., schema=schema). But I lack the knowledge of pyarrow to go further than what I've written about above.

lhoestq · 2025-03-27T13:07:49Z

It's indeed a bit more work to support nullable since in addition to your comments, there are unclear behavior when it comes to concatenating nullable with non-nullable, and maybe how to handle non-nullable lists and nested data.

But yup I agree having the Dataset.from_* function pass the schema to the pa.Table.from* would be the way.

Just one comment about this error:

ValueError: TypedSequence is supposed to be used with pa.array(typed_sequence, type=None)

This happens because Dataset.from_dict uses OptimizedTypedSequence by default, which should only be used if the user doesn't specify a schema

BramVanroy added 2 commits March 26, 2025 22:32

implement capability to restore non-nullability

cebbb34

fix documentation

ba51275

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement capability to restore non-nullability in Features #7482

Implement capability to restore non-nullability in Features #7482

BramVanroy commented Mar 26, 2025

BramVanroy commented Mar 27, 2025 •

edited

Loading

BramVanroy commented Mar 27, 2025

lhoestq commented Mar 27, 2025

Implement capability to restore non-nullability in Features #7482

Are you sure you want to change the base?

Implement capability to restore non-nullability in Features #7482

Conversation

BramVanroy commented Mar 26, 2025

BramVanroy commented Mar 27, 2025 • edited Loading

BramVanroy commented Mar 27, 2025

lhoestq commented Mar 27, 2025

BramVanroy commented Mar 27, 2025 •

edited

Loading