-
Notifications
You must be signed in to change notification settings - Fork 21
Add null
object, and update top-level API specification
#157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
class null: | ||
""" | ||
A `null` singleton object to represent missing data. | ||
|
||
``null`` may be used when constructing a `Column` from a Python sequence. | ||
It supports ``is``, and does not support ``==`` and ``bool``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could use some design discussion. Some initial questions that come to mind:
- We don't currently have a
NULL
type, would this object then be untyped?- PyArrow's singleton for example has a
NULL
type - Pandas's singleton does not have a type as far as I can tell
- Are there any APIs that take a scalar where not having a type here would be ambiguous?
- PyArrow's singleton for example has a
- If someone does something like calculate the max of a column with nulls in it with
skip_nulls=False
, would it be expected to return this singleton?- That would lose the typing information on that null scalar
- That would also require synchronizing in order to introspect if the result is null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently have a
NULL
type, would this object then be untyped?
I'm not sure that this is any different from other classes, like DataFrame
. They have a name, but no type beyond that. To type this sensible in a way that helps downstream libraries interact with multiple implementors, it'd need a typing.Protocol
, an ABC, or some such thing.
Personally I am happy to punt on typing, and only require here that null
exists under that name and behaves as specified.
If someone does something like calculate the max of a column with nulls in it with
skip_nulls=False
, would it be expected to return this singleton?
Yes, I think so. I don't think there's anything else we could return, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also require synchronizing in order to introspect if the result is null
Perhaps not - I think this is the same as methods that return float
. You can duck type scalars to keep them on the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I am happy to punt on typing, and only require here that null exists under that name and behaves as specified.
I think Keith was not talking about "typing" in the sense of python type annotations, but rather of data types (dtype). For example, if you do Column.from_sequence([null])
, what would be the type of that Column?
(the answer here might also be: you should specify the dtype
instead of letting it infer)
Yes, I think so. I don't think there's anything else we could return, right?
A specific implementation could return their own Scalar object of the specific type (which can be a null scalar), instead of this generic null object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the answer here might also be: you should specify the dtype instead of letting it infer
yes, I think that's the answer. and dtype
is mandatory in from_sequence
for that reason. we can document "no dtype inference anywhere" as a design rule separately?
A specific implementation could return their own Scalar object of the specific type (which can be a null scalar), instead of this generic null object
Agreed, and that's always allowed for any scalar type (see https://data-apis.org/dataframe-api/draft/design_topics/python_builtin_types.html). I meant semantically there's nothing else one can return. A duck-typed null
is still null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring here implied using is
to check if a scalar is null which would imply that this is a singleton and if we wanted a ducktyped null scalar to be a singleton it would require synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, let me think about how to rephrase this so that duck typing works, and explicitly reference the docs page on that topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased this, forbidding is null
and explicitly addressing that null
may be duck typed so it can live on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine to me
For ``_eq__``: a missing value must not be compared for equality | ||
directly. Instead, use `DataFrame.isnull` or `Column.isnull` to check | ||
for presence of missing values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can one check if a scalar is null reliably then? Do we need a namespace isnull
method?
I.E. someone runs my_column.sum(skip_null=False)
and it yields null
as the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I see a few options:
- Use
is
. Runs into the "that does not allow duck typing" problem - Use
==
. This may also not work when duck typing, or at least not desirable -null_duck_on device == null
is not going to do what you want ifnull
lives on host, and also it may cause folks to docolumn == null
which probably works if one implements__eq__
- Add a free
isnull
function. Better perhaps, only cost is a new object in the namespace. - Add a comparison method, e.g.
null.is_equal(val)
. Doesn't read all that well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the first two options are viable: is
doesn't work for type specific nulls (and I think we shouldn't disallow that if an implementation uses that?), and ==
shouldn't give the behaviour you want here (it shouldn't necessarily return True, just like nan == nan also doesn't give true)
A free is_null
function might be the easiest? (a method on the generic namespace.null method probably works as well, but a top-level function feels a bit more natural to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a free isnull
function which checks for a scalar null
. That should address this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of minor (non-blocking) comments, rest looks good
""" | ||
... | ||
|
||
def isnull(value: Any, /) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use : object
instead? Any
turns off the typechecker, and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can always address this later though - I'd say let's just get this in now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let's do that now. and you made that review comment on another PR before, sorry I forgot about it. will update in a minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if we're saying that any input type is valid, then we probably want to refrain from using any type-specific methods on
value
?
Sure. There's nothing saying that those exist or must be used, right? This seems like a detail that an implementer has to get right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's just easier for the implementer to get it right if the type checker flags to them if they're using something which all python objects don't have
Also includes
__dataframe_api_version__
, which was already present, into the API specification.The table of top-level objects/functions will remain pretty sparse, but there's probably a few more things that need adding.
concat
for one, see gh-137.I'll note that I briefly considered whether this should be spelled
NULL
orNull
, but went withnull
because that seemed most natural (e.g.,DataFrame.from_sequence([1, 2, null, np.nan]
).