Add `null` object, and update top-level API specification #157

rgommers · 2023-04-27T21:57:51Z

Also includes __dataframe_api_version__, which was already present, into the API specification.

The table of top-level objects/functions will remain pretty sparse, but there's probably a few more things that need adding. concat for one, see gh-137.

I'll note that I briefly considered whether this should be spelled NULL or Null, but went with null because that seemed most natural (e.g., DataFrame.from_sequence([1, 2, null, np.nan]).

kkraus14 · 2023-04-28T01:20:44Z

spec/API_specification/dataframe_api/__init__.py

+class null:
+    """
+    A `null` singleton object to represent missing data.
+
+    ``null`` may be used when constructing a `Column` from a Python sequence.
+    It supports ``is``, and does not support ``==`` and ``bool``.


I think this could use some design discussion. Some initial questions that come to mind:

We don't currently have a NULL type, would this object then be untyped?

PyArrow's singleton for example has a NULL type

Pandas's singleton does not have a type as far as I can tell

Are there any APIs that take a scalar where not having a type here would be ambiguous?

If someone does something like calculate the max of a column with nulls in it with skip_nulls=False, would it be expected to return this singleton?

That would lose the typing information on that null scalar

That would also require synchronizing in order to introspect if the result is null

We don't currently have a NULL type, would this object then be untyped?

I'm not sure that this is any different from other classes, like DataFrame. They have a name, but no type beyond that. To type this sensible in a way that helps downstream libraries interact with multiple implementors, it'd need a typing.Protocol, an ABC, or some such thing.

Personally I am happy to punt on typing, and only require here that null exists under that name and behaves as specified.

If someone does something like calculate the max of a column with nulls in it with skip_nulls=False, would it be expected to return this singleton?

Yes, I think so. I don't think there's anything else we could return, right?

That would also require synchronizing in order to introspect if the result is null

Perhaps not - I think this is the same as methods that return float. You can duck type scalars to keep them on the GPU.

Personally I am happy to punt on typing, and only require here that null exists under that name and behaves as specified.

I think Keith was not talking about "typing" in the sense of python type annotations, but rather of data types (dtype). For example, if you do Column.from_sequence([null]), what would be the type of that Column?
(the answer here might also be: you should specify the dtype instead of letting it infer)

Yes, I think so. I don't think there's anything else we could return, right?

A specific implementation could return their own Scalar object of the specific type (which can be a null scalar), instead of this generic null object

the answer here might also be: you should specify the dtype instead of letting it infer

yes, I think that's the answer. and dtype is mandatory in from_sequence for that reason. we can document "no dtype inference anywhere" as a design rule separately?

A specific implementation could return their own Scalar object of the specific type (which can be a null scalar), instead of this generic null object

Agreed, and that's always allowed for any scalar type (see https://data-apis.org/dataframe-api/draft/design_topics/python_builtin_types.html). I meant semantically there's nothing else one can return. A duck-typed null is still null.

The docstring here implied using is to check if a scalar is null which would imply that this is a singleton and if we wanted a ducktyped null scalar to be a singleton it would require synchronization.

That's a good point, let me think about how to rephrase this so that duck typing works, and explicitly reference the docs page on that topic.

I rephrased this, forbidding is null and explicitly addressing that null may be duck typed so it can live on device.

MarcoGorelli

looks fine to me

kkraus14 · 2023-05-10T21:45:12Z

spec/API_specification/dataframe_api/__init__.py

+        For ``_eq__``: a missing value must not be compared for equality
+        directly. Instead, use `DataFrame.isnull` or `Column.isnull` to check
+        for presence of missing values.


How can one check if a scalar is null reliably then? Do we need a namespace isnull method?

I.E. someone runs my_column.sum(skip_null=False) and it yields null as the result.

Good question, I see a few options:

Use is. Runs into the "that does not allow duck typing" problem

Use ==. This may also not work when duck typing, or at least not desirable - null_duck_on device == null is not going to do what you want if null lives on host, and also it may cause folks to do column == null which probably works if one implements __eq__

Add a free isnull function. Better perhaps, only cost is a new object in the namespace.

Add a comparison method, e.g. null.is_equal(val). Doesn't read all that well.

I don't think the first two options are viable: is doesn't work for type specific nulls (and I think we shouldn't disallow that if an implementation uses that?), and == shouldn't give the behaviour you want here (it shouldn't necessarily return True, just like nan == nan also doesn't give true)

A free is_null function might be the easiest? (a method on the generic namespace.null method probably works as well, but a top-level function feels a bit more natural to me)

Added a free isnull function which checks for a scalar null. That should address this issue.

MarcoGorelli

couple of minor (non-blocking) comments, rest looks good

MarcoGorelli · 2023-05-18T13:24:37Z

spec/API_specification/dataframe_api/__init__.py

+    """
+    ...
+
+def isnull(value: Any, /) -> bool:


should we use : object instead? Any turns off the typechecker, and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value?

can always address this later though - I'd say let's just get this in now

sure, let's do that now. and you made that review comment on another PR before, sorry I forgot about it. will update in a minute.

and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value?

Sure. There's nothing saying that those exist or must be used, right? This seems like a detail that an implementer has to get right.

yeah it's just easier for the implementer to get it right if the type checker flags to them if they're using something which all python objects don't have

spec/API_specification/dataframe_api/__init__.py

rgommers added the API design label Apr 27, 2023

rgommers mentioned this pull request Apr 27, 2023

Add design topic page on use of Python builtin types #153

Merged

kkraus14 reviewed Apr 28, 2023

View reviewed changes

rgommers mentioned this pull request Apr 28, 2023

Add a way to retrieve the standard-compliant namespace, fill "how to adopt" and "future evolution" #156

Merged

MarcoGorelli approved these changes May 2, 2023

View reviewed changes

rgommers force-pushed the add-null branch from e8c0f2f to 9b71c27 Compare May 10, 2023 18:04

rgommers mentioned this pull request May 10, 2023

Add a fill_nan method to dataframe and column #167

Merged

kkraus14 reviewed May 10, 2023

View reviewed changes

rgommers added 3 commits May 17, 2023 13:06

Add null object, and update top-level API specification

7676a90

Address review comments on singleton and duck typing null

3489e31

Add a free isnull function to the top-level namespace

55d53b6

rgommers force-pushed the add-null branch from 9b71c27 to 55d53b6 Compare May 17, 2023 11:14

kkraus14 approved these changes May 18, 2023

View reviewed changes

MarcoGorelli approved these changes May 18, 2023

View reviewed changes

MarcoGorelli and others added 2 commits May 18, 2023 14:25

fix _eq__ typo

ea276ca

Change isnull type annotation to object

884c67b

MarcoGorelli merged commit 15eda61 into data-apis:main May 18, 2023

rgommers deleted the add-null branch May 18, 2023 13:51

Add null object, and update top-level API specification #157

Add null object, and update top-level API specification #157

Uh oh!

Conversation

rgommers commented Apr 27, 2023

Uh oh!

kkraus14 Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add `null` object, and update top-level API specification #157

Add `null` object, and update top-level API specification #157

kkraus14 Apr 28, 2023 •

edited

Loading