-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Astype keeps nan when converting into string #28176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
makbigc
commented
Aug 27, 2019
•
edited
Loading
edited
- closes astype(str) / astype_unicode: np.nan converted to "nan" (checknull, skipna) #25353
- 1 test added
- passes black pandas
- whatsnew entry
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -145,7 +145,7 @@ Indexing | |||
Missing | |||
^^^^^^^ | |||
|
|||
- | |||
- When converting into string, :meth:`Series.astype` will keep ``np.nan`` as missing value (:issue:`25353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to move this to API breaking - I could foresee some people doing a "if x == 'nan'" after something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think this is ok to actually make this change, but let's make it more prominent
pandas/tests/series/test_dtypes.py
Outdated
@@ -147,7 +147,7 @@ def test_astype_datetime64tz(self): | |||
) | |||
def test_astype_str_map(self, dtype, series): | |||
# see gh-4405 | |||
result = series.astype(dtype) | |||
result = series.astype(dtype, skipna=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about skipna
here as it has a slightly different meaning than the usual context (i.e. ignoring it as part of an aggregation).
Which method is this actually getting passed through to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astype
method doesn't have skipna
keyword. skipna
is passed in kwargs
.
Line 5764 in e0c63b4
def astype(self, dtype, copy=True, errors="raise", **kwargs): |
skipna
is pop out from kwargs
if any. If not, skipna=True
is set and passed to astype_nansafe
. (In astype_nansafe
, skipna=False
is default.)
pandas/pandas/core/dtypes/cast.py
Line 663 in e0c63b4
def astype_nansafe(arr, dtype, copy=True, skipna=False): |
So, the default behaviour after this change is to skip converting nan
into string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't be passing non-accepted keywords in kwargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely needs a deprecation cycle.
We also need to consider whether skipna
is a valid keyword to .astype
and add it to the function signature, document it, etc.
For now, I think a specific keyword for controlling how np.nan
is converted with .astype(str)
make the most sense. I'm not sure what the name would be.
result = DataFrame([np.NaN]).astype(str) | ||
expected = DataFrame(["nan"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback do you recall, was the intent of this test that np.nan
be converted to the string 'nan'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC wanted to match numpy
So just some thoughts - I wouldn't consider the astype -> "nan" a feature as much as a bug, so while potentially breaking I somewhat feel like a deprecation cycle for buggy behavior is overkill (though others may differ) w.r.t. the skipna argument I don't see why users would ever really want "nan" when |
Right, the skipna argument would only be necessary if we're deprecating.
…On Tue, Aug 27, 2019 at 10:54 AM William Ayd ***@***.***> wrote:
So just some thoughts - I wouldn't consider the astype -> "nan" a feature
as much as a bug, so while potentially breaking I somewhat feel like a
deprecation cycle for buggy behavior is overkill (though others may differ)
w.r.t. the skipna argument I don't see why users would ever really want
"nan" when np.nan is available so maybe not needed?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#28176?email_source=notifications&email_token=AAKAOIVGVMT73RPUWCAU2JLQGVEZXA5CNFSM4IQFI3MKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IHBYA#issuecomment-525365472>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIVVNTOGMBZSQQBDPMLQGVEZXANCNFSM4IQFI3MA>
.
|
I'm +1 on having a deprecation cycle. I agree that this behavior feels buggy but it's actually consistent with numpy's behavior, and silently breaking consistency with numpy could be surprising to users. For example, using In [1]: import numpy as np; np.__version__
Out[1]: '1.16.4'
In [2]: a = np.array(['foo', 1, np.nan, 3.14], dtype=object)
In [3]: a
Out[3]: array(['foo', 1, nan, 3.14], dtype=object)
In [4]: a.astype(str)
Out[4]: array(['foo', '1', 'nan', '3.14'], dtype='<U4') Also note that when constructing In [5]: np.array(['foo', 1, np.nan, 3.14])
Out[5]: array(['foo', '1', 'nan', '3.14'], dtype='<U4') |
pandas/core/internals/blocks.py
Outdated
# _astype_nansafe works fine with 1-d only | ||
vals1d = values.ravel() | ||
values = astype_nansafe(vals1d, dtype, copy=True, **kwargs) | ||
values = astype_nansafe( | ||
vals1d, dtype, copy=True, skipna=skipna, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand why you changed this code, is this not already passed thru?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astype
method doesn't have skipna
keyword. skipna
is passed in kwargs
.
Line 5764 in e0c63b4
def astype(self, dtype, copy=True, errors="raise", **kwargs): |
skipna
is pop out from kwargs
if any. If not, skipna=True
is set and passed to astype_nansafe
. (In astype_nansafe
, skipna=False
is default.)
pandas/pandas/core/dtypes/cast.py
Line 663 in e0c63b4
def astype_nansafe(arr, dtype, copy=True, skipna=False): |
So, the default behaviour after this change is to skip converting nan
into string.
result = DataFrame([np.NaN]).astype(str) | ||
expected = DataFrame(["nan"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC wanted to match numpy
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -145,7 +145,7 @@ Indexing | |||
Missing | |||
^^^^^^^ | |||
|
|||
- | |||
- When converting into string, :meth:`Series.astype` will keep ``np.nan`` as missing value (:issue:`25353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think this is ok to actually make this change, but let's make it more prominent
63a8872
to
2cdd5b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to revise the test where skipna
is passed to .astype()
which is not allowed. Can you create an issue about this (disallow kwargs in .astype).
pandas/tests/series/test_dtypes.py
Outdated
@@ -147,7 +147,7 @@ def test_astype_datetime64tz(self): | |||
) | |||
def test_astype_str_map(self, dtype, series): | |||
# see gh-4405 | |||
result = series.astype(dtype) | |||
result = series.astype(dtype, skipna=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't be passing non-accepted keywords in kwargs.
doc/source/whatsnew/v1.0.0.rst
Outdated
|
||
When converting into string, :meth:`Series.astype` will not convert ``np.nan`` into string and keep it as missing value (:issue:`25353`) | ||
|
||
*pandas 0.25.x* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Previous behavior
doc/source/whatsnew/v1.0.0.rst
Outdated
1 nan | ||
dtype: object | ||
|
||
*pandas 1.0.0* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use New behavior
doc/source/whatsnew/v1.0.0.rst
Outdated
String conversion of Series with nan | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
When converting into string, :meth:`Series.astype` will not convert ``np.nan`` into string and keep it as missing value (:issue:`25353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:meth:Series.astype(str)
previously would coerce a np.nan
to the string nan
. Now pandas will preserve the missing value indicator.
String conversion of Series with nan | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
:meth:Series.astype(str) previously would coerce a np.nan to the string nan. Now pandas will preserve the missing value indicator (:issue:`25353`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, @jschendel and I both prefer a deprecation cycle. It looks like you're making a breaking change. Is that correct?
Just so you're all aware, this non-feature also appears to impact NoneType: >>> na = np.array(['foo','bar',5,np.nan,None])
>>> [type(x) for x in na]
[<class 'str'>, <class 'str'>, <class 'int'>, <class 'float'>, <class 'NoneType'>]
>>> da = pd.Series(na)
>>> [type(x) for x in da.astype('str')]
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]
>>> [type(x) for x in da.astype('str',skipna=True)]
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>] |
1820165
to
55d1cf7
Compare
Loosely thought about this after reading through #27949 but do we care to differentiate between the following like numpy? >>> np.array([1, np.nan, 3]).astype(object)
array([1.0, nan, 3.0], dtype=object)
>>> np.array([1, np.nan, 3]).astype(str)
array(['1.0', 'nan', '3.0'], dtype='<U32') Coercing to an object in NumPy would maintain the actual NA value, but to a string would write as |
I think that the |
Linting test failed (i.e., black formatting). But I can pass the black formatting check in my own machine. Where did I do wrong? |
@makbigc might want to give a pre-commit hook a try: https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#python-pep8-black In any case can you merge master and re-push? |
@jreback Anything else? Please tell me. |
can you merge master and will look again |
The Lines 55 to 59 in 8806ed7
A lot of tests were failed by the FutureWarning. What should I do? |
@makbigc I think need to address #28176 (review) first; the problem now is that the warning is getting thrown all over the place because If you can adjust design per comment can re-evaluate issues from there |
@WillAyd Thanks for your reply. The skipna parameter has been removed. The skipna is set True by default in The tests are failed by the FutureWarning. #28176 (comment) |
@jreback The tests were failed by the issue of FutureWarning. #28176 (comment) Any resolution? |
@jreback Any feedback? Thanks. |
It seems this didn't make it into pandas 1.0 but also |
If we are going to deprecate this, I think we need a keyword to opt in to the future behaviour, so you have the possibility to silence the warning. And if we are adding a keyword, it seems to make sense to use |
Where are we at here? Seems like the preference is for
Is that right? If so, @makbigc are you interested in implementing that? |
some time has passed since this PR was opened and pandas 1.0 has since been released.
So, for me, a deprecation cycle on old (incorrect) behaviour and keep the skipna keyword (but deprecate it) is probably now a requirement to keep a stable api through to pandas 2.0. This approach addresses both the regression in #31708 and the unexpected behaviour of #25353 |
@makbigc can you merge master to resolve conflicts, fix-up failing tests and address #28176 (comment) |
@makbigc closing as stale. ping if you want to continue. |
Hello, @simonjayhawkins, @TomAugspurger I would like to continue this PR as it seems it would solve multiple issues, I just want to declare everything that must be done before and to ensure that this new behaviour is still preferred.
Open Issues related
Unmerged PRs related |