Skip to content

Astype keeps nan when converting into string #28176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Aug 27, 2019

@@ -145,7 +145,7 @@ Indexing
Missing
^^^^^^^

-
- When converting into string, :meth:`Series.astype` will keep ``np.nan`` as missing value (:issue:`25353`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to move this to API breaking - I could foresee some people doing a "if x == 'nan'" after something like this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think this is ok to actually make this change, but let's make it more prominent

@@ -147,7 +147,7 @@ def test_astype_datetime64tz(self):
)
def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
result = series.astype(dtype, skipna=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about skipna here as it has a slightly different meaning than the usual context (i.e. ignoring it as part of an aggregation).

Which method is this actually getting passed through to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype method doesn't have skipna keyword. skipna is passed in kwargs.

def astype(self, dtype, copy=True, errors="raise", **kwargs):

skipna is pop out from kwargs if any. If not, skipna=True is set and passed to astype_nansafe. (In astype_nansafe, skipna=False is default.)

def astype_nansafe(arr, dtype, copy=True, skipna=False):

So, the default behaviour after this change is to skip converting nan into string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be passing non-accepted keywords in kwargs.

@WillAyd WillAyd added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 27, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely needs a deprecation cycle.

We also need to consider whether skipna is a valid keyword to .astype and add it to the function signature, document it, etc.

For now, I think a specific keyword for controlling how np.nan is converted with .astype(str) make the most sense. I'm not sure what the name would be.

result = DataFrame([np.NaN]).astype(str)
expected = DataFrame(["nan"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback do you recall, was the intent of this test that np.nan be converted to the string 'nan'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC wanted to match numpy

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2019

So just some thoughts - I wouldn't consider the astype -> "nan" a feature as much as a bug, so while potentially breaking I somewhat feel like a deprecation cycle for buggy behavior is overkill (though others may differ)

w.r.t. the skipna argument I don't see why users would ever really want "nan" when np.nan is available so maybe not needed?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 27, 2019 via email

@jschendel
Copy link
Member

I wouldn't consider the astype -> "nan" a feature as much as a bug, so while potentially breaking I somewhat feel like a deprecation cycle for buggy behavior is overkill

I'm +1 on having a deprecation cycle. I agree that this behavior feels buggy but it's actually consistent with numpy's behavior, and silently breaking consistency with numpy could be surprising to users.

For example, using astype(str) on an object dtype array converts np.nan -->'nan':

In [1]: import numpy as np; np.__version__
Out[1]: '1.16.4'

In [2]: a = np.array(['foo', 1, np.nan, 3.14], dtype=object)

In [3]: a
Out[3]: array(['foo', 1, nan, 3.14], dtype=object)

In [4]: a.astype(str)
Out[4]: array(['foo', '1', 'nan', '3.14'], dtype='<U4')

Also note that when constructing a as above, if dtype=object isn't explicitly specified the string conversion will silently occur:

In [5]: np.array(['foo', 1, np.nan, 3.14])
Out[5]: array(['foo', '1', 'nan', '3.14'], dtype='<U4')

# _astype_nansafe works fine with 1-d only
vals1d = values.ravel()
values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
values = astype_nansafe(
vals1d, dtype, copy=True, skipna=skipna, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand why you changed this code, is this not already passed thru?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype method doesn't have skipna keyword. skipna is passed in kwargs.

def astype(self, dtype, copy=True, errors="raise", **kwargs):

skipna is pop out from kwargs if any. If not, skipna=True is set and passed to astype_nansafe. (In astype_nansafe, skipna=False is default.)

def astype_nansafe(arr, dtype, copy=True, skipna=False):

So, the default behaviour after this change is to skip converting nan into string.

result = DataFrame([np.NaN]).astype(str)
expected = DataFrame(["nan"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC wanted to match numpy

@@ -145,7 +145,7 @@ Indexing
Missing
^^^^^^^

-
- When converting into string, :meth:`Series.astype` will keep ``np.nan`` as missing value (:issue:`25353`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think this is ok to actually make this change, but let's make it more prominent

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to revise the test where skipna is passed to .astype() which is not allowed. Can you create an issue about this (disallow kwargs in .astype).

@@ -147,7 +147,7 @@ def test_astype_datetime64tz(self):
)
def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
result = series.astype(dtype, skipna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be passing non-accepted keywords in kwargs.


When converting into string, :meth:`Series.astype` will not convert ``np.nan`` into string and keep it as missing value (:issue:`25353`)

*pandas 0.25.x*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Previous behavior

1 nan
dtype: object

*pandas 1.0.0*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use New behavior

String conversion of Series with nan
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When converting into string, :meth:`Series.astype` will not convert ``np.nan`` into string and keep it as missing value (:issue:`25353`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:meth:Series.astype(str) previously would coerce a np.nan to the string nan. Now pandas will preserve the missing value indicator.

String conversion of Series with nan
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:meth:Series.astype(str) previously would coerce a np.nan to the string nan. Now pandas will preserve the missing value indicator (:issue:`25353`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, @jschendel and I both prefer a deprecation cycle. It looks like you're making a breaking change. Is that correct?

@rendorHaevyn
Copy link

rendorHaevyn commented Sep 17, 2019

Just so you're all aware, this non-feature also appears to impact NoneType:

>>> na = np.array(['foo','bar',5,np.nan,None])
>>> [type(x) for x in na]
[<class 'str'>, <class 'str'>, <class 'int'>, <class 'float'>, <class 'NoneType'>]

>>> da = pd.Series(na)
>>> [type(x) for x in da.astype('str')]
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]
>>> [type(x) for x in da.astype('str',skipna=True)]
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]

@WillAyd
Copy link
Member

WillAyd commented Sep 19, 2019

Loosely thought about this after reading through #27949 but do we care to differentiate between the following like numpy?

>>> np.array([1, np.nan, 3]).astype(object)
array([1.0, nan, 3.0], dtype=object)

>>> np.array([1, np.nan, 3]).astype(str)
array(['1.0', 'nan', '3.0'], dtype='<U32')

Coercing to an object in NumPy would maintain the actual NA value, but to a string would write as 'nan'. I suppose this makes more sense in that world where object and string types are different, but I'm just wondering what kind of churn we would be causing if we actually did want astype(str) to stringify nan with a proper StringArray in place

@TomAugspurger
Copy link
Contributor

I think that the np.nan -> 'nan' behavior is surprising to most people the first time they see it. So by default, I would want to exclude NA values from being astyped (after a deprecation cycle).

@pep8speaks
Copy link

pep8speaks commented Oct 6, 2019

Hello @makbigc! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-04 06:25:32 UTC

@makbigc
Copy link
Contributor Author

makbigc commented Oct 7, 2019

Linting test failed (i.e., black formatting). But I can pass the black formatting check in my own machine. Where did I do wrong?

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@makbigc might want to give a pre-commit hook a try:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#python-pep8-black

In any case can you merge master and re-push?

@makbigc
Copy link
Contributor Author

makbigc commented Dec 16, 2019

@jreback Anything else? Please tell me.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

@makbigc
Copy link
Contributor Author

makbigc commented Jan 2, 2020

The -W error is set for pytest command in azure-37-numpydev.

py37_np_dev:
ENV_FILE: ci/deps/azure-37-numpydev.yaml
CONDA_PY: "37"
PATTERN: "not slow and not network"
TEST_ARGS: "-W error"

A lot of tests were failed by the FutureWarning. What should I do?

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2020

@makbigc I think need to address #28176 (review) first; the problem now is that the warning is getting thrown all over the place because skipna was added to astype

If you can adjust design per comment can re-evaluate issues from there

@makbigc
Copy link
Contributor Author

makbigc commented Jan 4, 2020

@WillAyd Thanks for your reply. The skipna parameter has been removed. The skipna is set True by default in astype_nansafe.

The tests are failed by the FutureWarning. #28176 (comment)

@makbigc
Copy link
Contributor Author

makbigc commented Jan 11, 2020

@jreback The tests were failed by the issue of FutureWarning. #28176 (comment)

Any resolution?

@makbigc
Copy link
Contributor Author

makbigc commented Jan 18, 2020

@jreback Any feedback? Thanks.

@languitar
Copy link

It seems this didn't make it into pandas 1.0 but also skipna has been removed from astype. What is the intended way of casting to string with NaNs in pandas 1.0?

@jorisvandenbossche
Copy link
Member

If we are going to deprecate this, I think we need a keyword to opt in to the future behaviour, so you have the possibility to silence the warning.

And if we are adding a keyword, it seems to make sense to use skipna for that, as that could (at least in 0.25) already be used for this purpose (although undocumented)

@TomAugspurger
Copy link
Contributor

Where are we at here? Seems like the preference is for

>>> s = pd.Series([1, np.nan])
>>> type(s.astype(str)[1])
UserWarning("Converting NaN to string 'nan'. In the future NaN will remain as NaN. Specify skipna=True to silence this warning and adopt the new behavior, or skipna=False to restore the old behavior").
str

>>> type(s.astype(str, skipna=False)[1])  # no warning
str

>>> type(s.astype(str, skipna=True)[1])  # no warning
float

Is that right? If so, @makbigc are you interested in implementing that?

@simonjayhawkins
Copy link
Member

some time has passed since this PR was opened and pandas 1.0 has since been released.

If we are going to deprecate this, I think we need a keyword to opt in to the future behaviour, so you have the possibility to silence the warning.

And if we are adding a keyword, it seems to make sense to use skipna for that, as that could (at least in 0.25) already be used for this purpose (although undocumented)

So, for me, a deprecation cycle on old (incorrect) behaviour and keep the skipna keyword (but deprecate it) is probably now a requirement to keep a stable api through to pandas 2.0.

This approach addresses both the regression in #31708 and the unexpected behaviour of #25353

@simonjayhawkins
Copy link
Member

@makbigc can you merge master to resolve conflicts, fix-up failing tests and address #28176 (comment)

@simonjayhawkins
Copy link
Member

@makbigc closing as stale. ping if you want to continue.

@NumberPiOso
Copy link
Contributor

NumberPiOso commented Feb 1, 2022

Hello, @simonjayhawkins, @TomAugspurger

I would like to continue this PR as it seems it would solve multiple issues, I just want to declare everything that must be done before and to ensure that this new behaviour is still preferred.

Open Issues related

Unmerged PRs related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

astype(str) / astype_unicode: np.nan converted to "nan" (checknull, skipna)