Skip to content

Add null object, and update top-level API specification #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 18, 2023

Conversation

rgommers
Copy link
Member

Also includes __dataframe_api_version__, which was already present, into the API specification.

The table of top-level objects/functions will remain pretty sparse, but there's probably a few more things that need adding. concat for one, see gh-137.

I'll note that I briefly considered whether this should be spelled NULL or Null, but went with null because that seemed most natural (e.g., DataFrame.from_sequence([1, 2, null, np.nan]).

Comment on lines 18 to 23
class null:
"""
A `null` singleton object to represent missing data.

``null`` may be used when constructing a `Column` from a Python sequence.
It supports ``is``, and does not support ``==`` and ``bool``.
Copy link
Collaborator

@kkraus14 kkraus14 Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use some design discussion. Some initial questions that come to mind:

  • We don't currently have a NULL type, would this object then be untyped?
    • PyArrow's singleton for example has a NULL type
    • Pandas's singleton does not have a type as far as I can tell
    • Are there any APIs that take a scalar where not having a type here would be ambiguous?
  • If someone does something like calculate the max of a column with nulls in it with skip_nulls=False, would it be expected to return this singleton?
    • That would lose the typing information on that null scalar
    • That would also require synchronizing in order to introspect if the result is null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently have a NULL type, would this object then be untyped?

I'm not sure that this is any different from other classes, like DataFrame. They have a name, but no type beyond that. To type this sensible in a way that helps downstream libraries interact with multiple implementors, it'd need a typing.Protocol, an ABC, or some such thing.

Personally I am happy to punt on typing, and only require here that null exists under that name and behaves as specified.

If someone does something like calculate the max of a column with nulls in it with skip_nulls=False, would it be expected to return this singleton?

Yes, I think so. I don't think there's anything else we could return, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also require synchronizing in order to introspect if the result is null

Perhaps not - I think this is the same as methods that return float. You can duck type scalars to keep them on the GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I am happy to punt on typing, and only require here that null exists under that name and behaves as specified.

I think Keith was not talking about "typing" in the sense of python type annotations, but rather of data types (dtype). For example, if you do Column.from_sequence([null]), what would be the type of that Column?
(the answer here might also be: you should specify the dtype instead of letting it infer)

Yes, I think so. I don't think there's anything else we could return, right?

A specific implementation could return their own Scalar object of the specific type (which can be a null scalar), instead of this generic null object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the answer here might also be: you should specify the dtype instead of letting it infer

yes, I think that's the answer. and dtype is mandatory in from_sequence for that reason. we can document "no dtype inference anywhere" as a design rule separately?

A specific implementation could return their own Scalar object of the specific type (which can be a null scalar), instead of this generic null object

Agreed, and that's always allowed for any scalar type (see https://data-apis.org/dataframe-api/draft/design_topics/python_builtin_types.html). I meant semantically there's nothing else one can return. A duck-typed null is still null.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here implied using is to check if a scalar is null which would imply that this is a singleton and if we wanted a ducktyped null scalar to be a singleton it would require synchronization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, let me think about how to rephrase this so that duck typing works, and explicitly reference the docs page on that topic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased this, forbidding is null and explicitly addressing that null may be duck typed so it can live on device.

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me

Comment on lines 31 to 93
For ``_eq__``: a missing value must not be compared for equality
directly. Instead, use `DataFrame.isnull` or `Column.isnull` to check
for presence of missing values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can one check if a scalar is null reliably then? Do we need a namespace isnull method?

I.E. someone runs my_column.sum(skip_null=False) and it yields null as the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I see a few options:

  • Use is. Runs into the "that does not allow duck typing" problem
  • Use ==. This may also not work when duck typing, or at least not desirable - null_duck_on device == null is not going to do what you want if null lives on host, and also it may cause folks to do column == null which probably works if one implements __eq__
  • Add a free isnull function. Better perhaps, only cost is a new object in the namespace.
  • Add a comparison method, e.g. null.is_equal(val). Doesn't read all that well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the first two options are viable: is doesn't work for type specific nulls (and I think we shouldn't disallow that if an implementation uses that?), and == shouldn't give the behaviour you want here (it shouldn't necessarily return True, just like nan == nan also doesn't give true)

A free is_null function might be the easiest? (a method on the generic namespace.null method probably works as well, but a top-level function feels a bit more natural to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a free isnull function which checks for a scalar null. That should address this issue.

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor (non-blocking) comments, rest looks good

"""
...

def isnull(value: Any, /) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use : object instead? Any turns off the typechecker, and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can always address this later though - I'd say let's just get this in now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's do that now. and you made that review comment on another PR before, sorry I forgot about it. will update in a minute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value?

Sure. There's nothing saying that those exist or must be used, right? This seems like a detail that an implementer has to get right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's just easier for the implementer to get it right if the type checker flags to them if they're using something which all python objects don't have

@MarcoGorelli MarcoGorelli merged commit 15eda61 into data-apis:main May 18, 2023
@rgommers rgommers deleted the add-null branch May 18, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants