Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: select_dtypes does not work properly with numpy.ndarray #60092

Closed
3 tasks done
fjossandon opened this issue Oct 23, 2024 · 8 comments
Closed
3 tasks done

BUG: select_dtypes does not work properly with numpy.ndarray #60092

fjossandon opened this issue Oct 23, 2024 · 8 comments
Labels

Comments

@fjossandon
Copy link

fjossandon commented Oct 23, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

table_df = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})

print(type(table_df["dest_id"][0]), type(table_df["arr_col"][0]))

print(table_df.select_dtypes(include=[np.ndarray]).columns)

Issue Description

I work with several dataframes, which occasionally have array columns. I was using select_dtypes to search for those columns containing Array type to manipulate them, but the function also returns the string columns, and my code crashes when it tries to apply the array function to the string column.

I was working with pandas 2.2.2 / numpy 1.26.2 when this happened, but I made a new environment and upgraded to the latest versions and it still happened.

Expected Behavior

This is the current output:

>>> print(type(table_df["dest_id"][0]), type(table_df["arr_col"][0]))
<class 'str'> <class 'numpy.ndarray'>
>>> 
>>> print(table_df.select_dtypes(include=[np.ndarray]).columns)
Index(['dest_id', 'arr_col'], dtype='object')

I would expect that only arr_col is returned with select_dtypes when using include=[np.ndarray].

Installed Versions

INSTALLED VERSIONS

commit : 68d9dca
python : 3.11.6
python-bits : 64
OS : Linux
OS-release : 6.5.0-44-generic
Version : #44-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:10:09 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1580.g68d9dcab5b
numpy : 2.1.2
dateutil : 2.9.0.post0
pip : 23.2
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : 5.2.2
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pytz : 2024.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@fjossandon fjossandon added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 23, 2024
@rhshadrach
Copy link
Member

Thanks for the report. Is there any documentation suggesting this should work? np.array is not a dtype, it is a type of storage container.

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 25, 2024
@fjossandon
Copy link
Author

Hello @rhshadrach ,
To be honest, using select_dtypes to recognize the Numpy array columns was suggested to me in another discussion regarding DuckDB (see here: duckdb/duckdb#14451 (comment)), and I thought it was enough until I got a crash when using .map(np.ndarray.tolist) over text columns that select_dtypes returned.

Before submitting this ticket, something else I tried and failed was to use the same include, but to add the exclusion of string (because "object" is not an option)... this way, I thought, "ok array and text columns are both object type but maybe it can recognize the string one", but I got an error about string not allowed:

>>> print(table_df.select_dtypes(include=[np.ndarray], exclude=str).columns)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fossandon/pyenvs/pandas/lib/python3.11/site-packages/pandas/core/frame.py", line 4889, in select_dtypes
    invalidate_string_dtypes(dtypes)
  File "/home/fossandon/pyenvs/pandas/lib/python3.11/site-packages/pandas/core/dtypes/cast.py", line 969, in invalidate_string_dtypes
    raise TypeError("string dtypes are not allowed, use 'object' instead")
TypeError: string dtypes are not allowed, use 'object' instead

Using a regular list instead of Numpy array does not work either:

>>> table_df = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [[1, 2], [3, 4], [4, 5]]})
>>> table_df.dtypes
id          int64
dest_id    object
arr_col    object
dtype: object

>>> print(table_df.select_dtypes(include=list).columns)
Index(['dest_id', 'arr_col'], dtype='object')

Also, there is something that seems to contradict this part of the documentation (https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes):

pandas has two ways to store strings.
1. object dtype, which can hold any Python object, including strings.
2. [StringDtype](https://pandas.pydata.org/docs/reference/api/pandas.StringDtype.html#pandas.StringDtype), which is dedicated to strings.

The documentation mentions a string type, but any simple dataframe created with text still uses "object", so using select_dtypes does not work as expected either:

>>> string_df = pd.DataFrame({"text": ["X", "Y", "Z"]})
>>> print(string_df.dtypes)
text    object
dtype: object

>>> print(string_df.select_dtypes(pd.StringDtype).columns)
<stdin>:1: UserWarning: Instantiating StringDtype without any arguments.Pass a StringDtype instance to silence this warning.
Index([], dtype='object')

It would seem that there just isn't a way to select text columns either (to separate them from the array ones)...

Out of the Pandas methods I've seen, select_dtypes seemed like the most appropriate for this task, but it looks like it does not work as I thought and I only grew more confused... In the meantime, it looks like using type() is the only option for my particular problem.

@rhshadrach
Copy link
Member

The documentation mentions a string type, but any simple dataframe created with text still uses "object", so using select_dtypes does not work as expected either:

See https://pandas.pydata.org/docs/whatsnew/v2.2.0.html#dedicated-string-data-type-backed-by-arrow-by-default

@rhshadrach
Copy link
Member

rhshadrach commented Oct 25, 2024

To be honest, using select_dtypes to recognize the Numpy array columns was suggested to me in another discussion regarding DuckDB (see here: duckdb/duckdb#14451 (comment))

I am not aware of support for this.

Does this resolve your issue?

pd.set_option("future.infer_string", True)
table_df = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})
print(table_df.select_dtypes(include=object).columns)
# Index(['arr_col'], dtype='str')

You can also use include="string" or exclude="string".

@fjossandon
Copy link
Author

fjossandon commented Oct 30, 2024

Hi @rhshadrach,
Sorry for the late reply, I couldn't reply sooner... That new string data type sounds interesting!
I tried that set_option, thanks for the suggestion.
This is what I got with pandas 2.2.2 (my regular environment version):

>>> table_df = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})
>>> print(table_df.dtypes)
id          int64
dest_id    object
arr_col    object
dtype: object

>>> pd.set_option("future.infer_string", True)

>>> table_df2 = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})
>>> print(table_df2.dtypes)
id                         int64
dest_id    string[pyarrow_numpy]
arr_col                   object
dtype: object

>>> print(table_df.select_dtypes(include=object).columns)
Index(['dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(include=object).columns)
Index(['arr_col'], dtype='string')

>>> print(table_df.select_dtypes(include=object, exclude=str).columns)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fossandon/pyenvs/base/lib/python3.11/site-packages/pandas/core/frame.py", line 5064, in select_dtypes
    invalidate_string_dtypes(dtypes)
  File "/home/fossandon/pyenvs/base/lib/python3.11/site-packages/pandas/core/dtypes/cast.py", line 970, in invalidate_string_dtypes
    raise TypeError("string dtypes are not allowed, use 'object' instead")
TypeError: string dtypes are not allowed, use 'object' instead

>>> print(table_df.select_dtypes(include=object, exclude="string").columns)
Index(['dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(include=object, exclude="string").columns)
Index(['arr_col'], dtype='string')

>>> print(table_df.select_dtypes(exclude="string").columns)
Index(['id', 'dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(exclude="string").columns)
Index(['id', 'arr_col'], dtype='string')

So in pandas 2.2.2, with future.infer_string the string column changes from object to string[pyarrow_numpy], and using str as dtype crashes, but using 'string' works.

And this is what I got with pandas 3.0.0.dev0+1580.g68d9dcab5b:

>>> table_df = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})
>>> print(table_df.dtypes)
id          int64
dest_id    object
arr_col    object
dtype: object

>>> pd.set_option("future.infer_string", True)

>>> table_df2 = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})
>>> print(table_df2.dtypes)
id          int64
dest_id       str
arr_col    object
dtype: object

>>> print(table_df.select_dtypes(include=object).columns)
Index(['dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(include=object).columns)
Index(['arr_col'], dtype='str')

>>> print(table_df.select_dtypes(include=object, exclude=str).columns)
Index(['dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(include=object, exclude=str).columns)
Index(['arr_col'], dtype='str')

>>> print(table_df.select_dtypes(include=object, exclude="string").columns)
Index(['dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(include=object, exclude="string").columns)
Index(['arr_col'], dtype='str')

>>> print(table_df.select_dtypes(exclude="string").columns)
Index(['id', 'dest_id', 'arr_col'], dtype='object')
>>> print(table_df2.select_dtypes(exclude="string").columns)
Index(['id', 'arr_col'], dtype='str')

So in pandas 3.0.0, with future.infer_string the string column changes from object to str, and this time using str as dtype works. But the string column is still object without the infer... is it because it's still a development version branch and will switch to str by default in the first v3 release???

In summary, in both versions future.infer_string must be active before the dataframe is built to work correctly, or else the "str" or "string" will have no effect on select_dtypes.


Ok, so it looked promising and I tried it in my actual code (with pandas version 2.2.2), activating infer before the dataframe is created, and it worked partially. Some text or numeric columns dtypes were "object", when they should be text or numeric, and after looking I found that the reason was that some columns had NULL values, and since I am using .replace(to_replace={np.nan: None}) to avoid NaN (psycopg2 does not convert them to NULL), that switched those column dtypes to "object". Is this expected???
This is a demo code:

>>> table_df_nan = pd.DataFrame({"id": [np.nan, 2, 3], "dest_id": ["var-foo-1", np.nan, "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.nan], "nan_col": [np.nan, np.nan, np.nan]})
>>> table_df_none = pd.DataFrame({"id": [None, 2, 3], "dest_id": ["var-foo-1", None, "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), None], "nan_col": [None, None, None]})

>>> table_df_nan
    id    dest_id arr_col  nan_col
0  NaN  var-foo-1  [1, 2]      NaN
1  2.0        NaN  [3, 4]      NaN
2  3.0  baz-taz-1     NaN      NaN
>>> table_df_none
    id    dest_id arr_col nan_col
0  NaN  var-foo-1  [1, 2]    None
1  2.0        NaN  [3, 4]    None
2  3.0  baz-taz-1    None    None

>>> print(table_df_nan.dtypes)
id                       float64
dest_id    string[pyarrow_numpy]
arr_col                   object
nan_col                  float64
dtype: object
>>> print(table_df_none.dtypes)
id                       float64
dest_id    string[pyarrow_numpy]
arr_col                   object
nan_col                   object
dtype: object

>>> print(table_df_nan.replace(to_replace={np.nan: None}).dtypes)
id         object
dest_id    object
arr_col    object
nan_col    object
dtype: object
>>> print(table_df_none.replace(to_replace={np.nan: None}).dtypes)
id         object
dest_id    object
arr_col    object
nan_col    object
dtype: object

Interestingly, in table_df_none, the None I used in the "dest_id" string column when building was still converted to NaN, but was respected in the other columns. Apart from that, it looks like the Python None in the replace changes every type to "object", instead of retaining its original data type (float64 or string)... I understand that if a column has every value as None/NaN/NULL/NA, it would not be possible to guess any particular type, but since it had a non-object dtype before the replacement, wouldn't make sense to keep the original dtype?? or recalculate it with the remaining non-undetermined values??

Best regards and thanks for your patience.

@rhshadrach
Copy link
Member

But the string column is still object without the infer... is it because it's still a development version branch and will switch to str by default in the first v3 release???

The default will be infer_string=True in pandas 3.0. We are waiting for the release of 2.3 to update the main branch with the new default.

Some text or numeric columns dtypes were "object", when they should be text or numeric, and after looking I found that the reason was that some columns had NULL values, and since I am using .replace(to_replace={np.nan: None}) to avoid NaN (psycopg2 does not convert them to NULL), that switched those column dtypes to "object". Is this expected???

Can you simplify this example as much as possible, showing only the columns that defy your expectations, and show the dtypes that you get along with stating the dtypes you expect.

@fjossandon
Copy link
Author

Hello @rhshadrach

The default will be infer_string=True in pandas 3.0. We are waiting for the release of 2.3 to update the main branch with the new default.

Ok!


Can you simplify this example as much as possible, showing only the columns that defy your expectations, and show the dtypes that you get along with stating the dtypes you expect.

Ok, I will simplify.
For the following dataframe:

import pandas as pd
import numpy as np

pd.set_option("future.infer_string", True)

table_df_nan = pd.DataFrame({"integers": [np.nan, 2, 3], "strings": ["var", np.nan, "baz"], "floats": [0.1, 0.2, np.nan]})

print(table_df_nan)
#    integers strings  floats
# 0       NaN     var     0.1
# 1       2.0     NaN     0.2
# 2       3.0     baz     NaN

print(table_df_nan.dtypes)
# integers                  float64
# strings     string[pyarrow_numpy]
# floats                    float64
# dtype: object

First, shouldn't "integers" column be dtype int64??
Then, replacing NaN with None, besides the value replacement, all the columns affected changed their dtypes to object:

print(table_df_nan.replace(to_replace={np.nan: None}))
#   integers strings floats
# 0     None     var    0.1
# 1      2.0    None    0.2
# 2      3.0     baz   None

print(table_df_nan.replace(to_replace={np.nan: None}).dtypes)
# integers    object
# strings     object
# floats      object
# dtype: object

But what I would expect is that the dtype remains the same after replacement, since I'm not changing the nature of the non-null values, so something like this:

print(table_df_nan.replace(to_replace={np.nan: None}).dtypes)
# integers                  float64
# strings     string[pyarrow_numpy]
# floats                    float64
# dtype: object

I hope I explained better this time.

@rhshadrach
Copy link
Member

pandas Series / columns can have one and only one dtype.

First, shouldn't "integers" column be dtype int64??

No, int64 is not able to store NaN values. If you want a nullable integer column, I'd suggest trying specifying the dtype as Int64 or int[pyarrow].

df = pd.DataFrame(
    {
        "a": pd.array([pd.NA, 0, 1], dtype="Int64"),
        "b": pd.array([pd.NA, 0, 1], dtype="int64[pyarrow]"),
    }
)
print(df)
#       a     b
# 0  <NA>  <NA>
# 1     0     0
# 2     1     1

But what I would expect is that the dtype remains the same after replacement, since I'm not changing the nature of the non-null values, so something like this:

This is not possible, None is a Python object, it can not be stored in int64 nor float64 dtype. The only dtype that can store general Python objects like None is object.

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants