Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

astype(str) / astype_unicode: np.nan converted to "nan" (checknull, skipna) #25353

Open
ThibTrip opened this issue Feb 17, 2019 · 17 comments
Open
Labels
Astype Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data
Milestone

Comments

@ThibTrip
Copy link
Contributor

ThibTrip commented Feb 17, 2019

Code Sample

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series(["foo",np.nan]).astype(str)

Output

0 foo
1 nan # nan string
dtype: object

Expected output

0 foo
1 NaN # np.nan
dtype: object

Problem description

Upon converting this Series I would expect np.nan to remain np.nan but instead it is casted to string "nan". Maybe I'm alone in this and you'd actually expect that (I can't see a realistic use case for these string "nan" but well...).
So I could figure out than upon using the code sample the Series' values are processed through astype_unicode in pandas._libs.lib.

There is a skipna argument in astype_unicode and I thought it would get passed along when using pd.Series.astype(str,skipna = True) but it does not. The docstring of pd.Series.astype does not mention skipna explicitely but mentions kwargs so I tried doing this while printing skipna in astype_unicode:

pd.Series(["foo",np.nan]).astype(str, skipna = True)

Skipna stayed to the default value False in astype_unicode so it does not get passed along.

However when using astype_unicode directly setting skipna to True will not change the output of the the code sample anyways because checknull does not seem to work properly.
You can test that by printing the result of checknull in lib.pyx as I did here:
https://github.com/ThibTrip/pandas/commit/4a5c8397304e3026456d864fd5aeb7b8b9adca5f

Input

>>> from pandas._lib.lib import astype_unicode
>>> import numpy as np
>>> astype_unicode(np.array(["foo",np.nan]))

Output

foo is null? False
nan is null? False
array(['foo', 'nan'], dtype=object)

Expected output

foo is null? False
nan is null? True
array(['foo', NaN], dtype=object)

I tried patching it (so not a proper fix) using the code below but it din't work.
Also would be nice to be able to do pd.Series.astype(str,skipna = True). Whether skipna should then be True or False as default is another matter.

>>> if not (skipna and checknull(arr_i)):
>>>     if arr_i is not np.nan:
>>>         arr_i = unicode(arr_i)

All of this was done in a developper version I installed today (see details below). The only alteration is the code in my commit linked above.
Sorry if this has been referenced before I searched in various ways and could not find anything except a similar issue with pd.read_excel and dtype str:

nikoskaragiannakis@694849d

Also very sorry for the mess with the commits I got a bit confused during my investigation (also I did not get enough sleep). Is it possible to delete all but my last commit? The other ones are irrelevant.

Cheers Thibault

INSTALLED VERSIONS

commit: 4a5c839
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.25.0.dev0+132.g4a5c83973.dirty
pytest: 4.2.0
pip: 19.0.1
setuptools: 40.7.3
Cython: 0.29.5
numpy: 1.15.4
scipy: 1.2.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.6.0
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.3.1
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.17
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels Feb 18, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 18, 2019

cc @jreback : Besides the Excel issue listed in the description, this seems to ring a bell...

@makbigc
Copy link
Contributor

makbigc commented Aug 26, 2019

I run the code recently. Passing skipna=True as kwargs works. Should we add the skipna=True in the intermediate function call?

In [9]: pd.__version__
Out[9]: '0.25.0+179.gc3d5f227f'

In [10]: ser = pd.Series(['foo', np.nan])

In [11]: ser
Out[11]: 
0    foo
1    NaN
dtype: object

In [12]: ser.astype(str, skipna=True)
Out[12]: 
0    foo
1    NaN
dtype: object

In [14]: np.isnan(ser.astype(str, skipna=True)[1])
Out[14]: True

@rendorHaevyn
Copy link

rendorHaevyn commented Sep 4, 2019

This treatment of None/np.nan is quite unexpected.

I have found that this issue propagates into output of dataframes to csv / excel / clipboard, and boolean evaluation.

The logical function for astype(str) here would be to skip all NA types by default (None, np.nan, etc).

As an example:

pd.__version__
'0.24.2'
  • input as np.nan:
srs = pd.Series([np.nan,np.nan,5])
srs
srs.any()
pd.isna(srs)
srs.to_clipboard()
  • output as nan:
0    NaN
1    NaN
2    5.0
dtype: float64
True
0     True
1     True
2    False
dtype: bool
  • clipboard output as nan:
    | Index | Series_Out |
    |------- |------------ |
    | 0 | |
    | 1 | |
    | 2 | 5 |

  • input after astype(str) cast:

srs = srs.astype(str,skipna=True)
srs
srs.any()
pd.isna(srs)
srs.to_clipboard()
  • output after astype(str) cast:
0    nan
1    nan
2    5.0
dtype: object
'nan'
0    False
1    False
2    False
dtype: bool
  • clipboard output after astype(str) cast:
    | Index | Series_Out |
    |------- |------------ |
    | 0 | nan |
    | 1 | nan |
    | 2 | 5 |

@simonjayhawkins
Copy link
Member

The logical function for astype(str) here would be to skip all NA types by default (None, np.nan, etc).

pandas matches the behaviour of numpy

>>> import numpy as np
>>>
>>> np.__version__
'1.18.1'
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1008.g60b0e9fbc'
>>>
>>> arr = np.array([None, np.nan])
>>> arr
array([None, nan], dtype=object)
>>>
>>> arr.astype(str)
array(['None', 'nan'], dtype='<U4')
>>>
>>> pd.Series(arr).astype(str).apply(type)
0    <class 'str'>
1    <class 'str'>
dtype: object
>>>
>>> arr = np.array([1, 2, np.nan], dtype="float")
>>> arr
array([ 1.,  2., nan])
>>>
>>> arr.astype(str)
array(['1.0', '2.0', 'nan'], dtype='<U32')
>>>
>>> pd.Series(arr).astype(str).apply(type)
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
dtype: object
>>>

@mroeschke mroeschke added the Bug label May 3, 2020
@rosstex
Copy link

rosstex commented May 17, 2020

So, is this considered a bug or not? I now have to workaround it in my current code. Annoying because when concatenating many columns together into a string, a la

ipv4_banners_pd["ssh_banner"]
.astype(str).str.cat(ipv4_banners_pd["telnet_banner"], sep='_', na_rep="")
.astype(str).str.cat(ipv4_banners_pd["snmp_banner"], sep='_', na_rep="")

the nan's are converted to "" in the latter two columns with the na_rep parameter, but they can't be converted in the first, which seems unintuitive.

@jreback
Copy link
Contributor

jreback commented May 17, 2020

yes this is an open bug
there were several attempts to patch it - see the PR refs

@rcrowell
Copy link

Is a similar thing happening with an apply that changes the dtype from numeric to object?

>>> def pretty_num(num):
...     return np.nan if num is np.nan else '%0.04f' % (num,)
... 
>>> nums = pd.Series([1.0, 0.1, np.nan])
>>> nums.apply(pretty_num).tolist()
['1.0000', '0.1000', 'nan']
>>> type(nums.apply(pretty_num)[2])
<class 'str'>

@abekfenn
Copy link

abekfenn commented May 6, 2021

Update on this?

@jreback
Copy link
Contributor

jreback commented May 6, 2021

@abekfennessy pandas is purely volunteer - you are welcome to propose solutions and push PRs

there are 3500 issues and very few volunteers

@abekfenn
Copy link

abekfenn commented May 7, 2021

Point taken. If I understand your comment here #35060 correctly, astype_nansafe is deprecated and the underlying cause being the issue as it originates in numpy should be resolved?
Want to make sure I direct my efforts in the right place. Thanks!

@jreback
Copy link
Contributor

jreback commented Feb 13, 2022

should be closed by #37034

@hdaly0
Copy link

hdaly0 commented May 25, 2022

I don't think #37034 closes this ticket because the change defines the expected behaviour to be:

None -> "None"
np.nan -> "nan"
NA -> "<NA>"

where all three values get transformed into string types with various values, not NA types, which is what the ticket is about.

@hdaly0
Copy link

hdaly0 commented May 27, 2022

It seems if you do:

pd.Series(["foo",np.nan]).astype("string")

instead of

pd.Series(["foo",np.nan]).astype(str)

pandas uses a String datatype with nulls and this fixed the issue for me.

Specifically:

s = pd.Series(["foo",np.nan]).astype("string")
type(s[1])
>> <class 'pandas._libs.missing.NAType'>

s = pd.Series(["foo",np.nan]).astype("str")
type(s[1])
>> <class 'str'>

Running versions:
pandas 1.3.5
numpy 1.21.6

@rodrigoformi-zh
Copy link

rodrigoformi-zh commented Jun 2, 2022

ˆ ˆ ˆ
THIS WORKS!

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Dec 9, 2022
@jorisvandenbossche
Copy link
Member

Is this something we might want to do as breaking change in 2.0?

(finally change astype to preserves missing values, instead of converting it to their string repr)

It seems that generally people were in favor of that behaviour change, but it didn't make it in pandas 1.0 (#28176 (comment)). At that time, there was some discussion whether it should first be deprecated or not (eg in the linked PR). And of course for 2.0 that is again too late, so on the short term it is only possible as a breaking change.

@mroeschke mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023
@dgutiero97
Copy link

It seems if you do:

pd.Series(["foo",np.nan]).astype("string")

instead of

pd.Series(["foo",np.nan]).astype(str)

pandas uses a String datatype with nulls and this fixed the issue for me.

Specifically:

s = pd.Series(["foo",np.nan]).astype("string")
type(s[1])
>> <class 'pandas._libs.missing.NAType'>

s = pd.Series(["foo",np.nan]).astype("str")
type(s[1])
>> <class 'str'>

Running versions: pandas 1.3.5 numpy 1.21.6

Solution tested, actually works

@jorisvandenbossche
Copy link
Member

With the new string dtype for pandas 3.0 (PDEP-14, #54792), we have a good opportunity to fix this issue.

In #59685, I am making it so that astype(str) becomes an alias of casting to the future default string dtype, and at that point it will take the code path of StringDtype which preserves missing values when coercing or casting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.