-
Notifications
You must be signed in to change notification settings - Fork 21
Add a fill_nan
method to dataframe and column
#167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good (barring docs build error)
Addresses half of data-apisgh-142 (`fill_null` is more complex, and not included here).
It's green now. I had to do the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
@@ -456,3 +456,17 @@ def unique_indices(self, *, skip_nulls: bool = True) -> Column[int]: | |||
To get the unique values, you can do ``col.get_rows(col.unique_indices())``. | |||
""" | |||
... | |||
|
|||
def fill_nan(self, value: float | 'null', /): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit unrelated to this PR, but having null
be typed differently feels like an anti-pattern here. It differentiates between a float scalar (which is implicitly nullable based on our current scalar definition) and a null scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float scalar (which is implicitly nullable based on our current scalar definition
We don't have numpy-style scalars (i.e., instances of a dtype) though? That's why we need a separate null
object, so that one can construct a column containing nulls with column_from_sequence([1.5, 2.5, null, 4.5])
.
We could add dtype instances and specify that null
derives from float
, but that seems like a huge can of worms for no gain at all. And I think the consensus from what we learned from numpy is that array scalars were a major design mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we can construct a column with column_from_sequence([1.5, 2.5, null, 4.5], dtype='float64')
(just pointing this out because it's early days, and I wouldn't want someone to see this and get confused)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think the consensus from what we learned from numpy is that array scalars were a major design mistake.
Yes, but I think the thing that was agreed as the correct path forward was 0d arrays which we don't have on the DataFrame side. Those 0d arrays are strongly typed and don't have to deal with nulls.
The issue that I see is that someone could do something like:
my_int_column = column_from_sequence([1, 2, None, 4], dtype='int32')
max_my_int_column = my_int_column.max(skip_null=False) # Yields a ducktyped `null` scalar that is int32 typed. Is this `int` type or `null` type from a Typing perspective?
my_float_column = column_from_sequence([1.5, 2.5, max_my_int_column, 4.5], dtype='float64') # Does this work if the max is `null`?
For example, PyArrow handles this by having an explicit NULL
type (https://arrow.apache.org/docs/python/generated/pyarrow.null.html#pyarrow.null) and presumably has its underlying APIs and compute explicitly handle mixing NULL
typed scalars / columns with other typed scalars / columns.
Maybe we just need an explicit NULL
type and then 'null'
here refers to a scalar of type NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just need an explicit
NULL
type
We do have exactly that already: docs for null
. The only reason the type annotation is 'null'
rather than null
is to avoid some circular import and Sphinx weirdness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an object for a null
scalar as opposed to a NULL
data type. I.E. allowing a column to be typed NULL
and extracting a null-valued scalar from that column has type NULL
versus extracting a null-valued scalar from a float64 column has type float64
.
It feels counter-intuitive that Columns are type-erased (i.e. just a Column
class and no Int32Column
, Float32Column
, etc.) but the scalars that are contained within Columns are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, this should go into a new issue instead of this PR. Just the typing felt a bit funky to me here.
I'll open a new issue for discussion and approve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a new issue sounds good for this. I had not thought before about a need for a null dtype; if there is one we should indeed consider it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, PyArrow handles this by having an explicit
NULL
type
Small clarification here: while pyarrow indeed has a "null" data type, we also have type-specific null scalars for each data type. And so in your specific example, the max_my_int_column
would actually be an int32 scalar (with the value of "null"), and not a scalar of the null data type.
Co-authored-by: Marco Edward Gorelli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good, thanks - @kkraus14 any further comments or good to go?
This now has three approvals, so I'll get it in. Thanks all! |
Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.
Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.
Addresses half of gh-142 (
fill_null
is more complex, and not included here).Note: this is reviewable now, but should be merged after gh-157 which introduces the
null
object).