-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: select_dtypes does not work properly with numpy.ndarray #60092
Comments
Thanks for the report. Is there any documentation suggesting this should work? |
Hello @rhshadrach , Before submitting this ticket, something else I tried and failed was to use the same include, but to add the exclusion of string (because "object" is not an option)... this way, I thought, "ok array and text columns are both object type but maybe it can recognize the string one", but I got an error about string not allowed:
Using a regular list instead of Numpy array does not work either:
Also, there is something that seems to contradict this part of the documentation (https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes):
The documentation mentions a string type, but any simple dataframe created with text still uses "object", so using select_dtypes does not work as expected either:
It would seem that there just isn't a way to select text columns either (to separate them from the array ones)... Out of the Pandas methods I've seen, select_dtypes seemed like the most appropriate for this task, but it looks like it does not work as I thought and I only grew more confused... In the meantime, it looks like using |
|
I am not aware of support for this. Does this resolve your issue? pd.set_option("future.infer_string", True)
table_df = pd.DataFrame({"id": [1, 2, 3], "dest_id": ["var-foo-1", "var-foo-2", "baz-taz-1"], "arr_col": [np.array([1, 2]), np.array([3, 4]), np.array([4, 5])]})
print(table_df.select_dtypes(include=object).columns)
# Index(['arr_col'], dtype='str') You can also use |
Hi @rhshadrach,
So in pandas 2.2.2, with And this is what I got with pandas 3.0.0.dev0+1580.g68d9dcab5b:
So in pandas 3.0.0, with In summary, in both versions Ok, so it looked promising and I tried it in my actual code (with pandas version 2.2.2), activating infer before the dataframe is created, and it worked partially. Some text or numeric columns dtypes were "object", when they should be text or numeric, and after looking I found that the reason was that some columns had NULL values, and since I am using
Interestingly, in Best regards and thanks for your patience. |
The default will be
Can you simplify this example as much as possible, showing only the columns that defy your expectations, and show the dtypes that you get along with stating the dtypes you expect. |
Hello @rhshadrach
Ok!
Ok, I will simplify.
First, shouldn't "integers" column be dtype int64??
But what I would expect is that the dtype remains the same after replacement, since I'm not changing the nature of the non-null values, so something like this:
I hope I explained better this time. |
pandas Series / columns can have one and only one dtype.
No, df = pd.DataFrame(
{
"a": pd.array([pd.NA, 0, 1], dtype="Int64"),
"b": pd.array([pd.NA, 0, 1], dtype="int64[pyarrow]"),
}
)
print(df)
# a b
# 0 <NA> <NA>
# 1 0 0
# 2 1 1
This is not possible, |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I work with several dataframes, which occasionally have array columns. I was using
select_dtypes
to search for those columns containing Array type to manipulate them, but the function also returns the string columns, and my code crashes when it tries to apply the array function to the string column.I was working with pandas 2.2.2 / numpy 1.26.2 when this happened, but I made a new environment and upgraded to the latest versions and it still happened.
Expected Behavior
This is the current output:
I would expect that only
arr_col
is returned with select_dtypes when usinginclude=[np.ndarray]
.Installed Versions
INSTALLED VERSIONS
commit : 68d9dca
python : 3.11.6
python-bits : 64
OS : Linux
OS-release : 6.5.0-44-generic
Version : #44-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:10:09 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+1580.g68d9dcab5b
numpy : 2.1.2
dateutil : 2.9.0.post0
pip : 23.2
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : 5.2.2
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pytz : 2024.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: