-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: map builtin str alias to StringDtype #59685
Changes from all commits
f444105
630d41c
d127770
38d011a
c4ed9d3
cad7d8f
189e26d
1089eb3
8f7e968
51900f1
15f45d2
4464fb1
650f694
49297f0
9164dbb
cf9f855
bd79fc9
4c775d1
3b0b779
b0276b2
d413fc6
d634da2
e791330
db8900c
e6aad17
4e6cf04
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6262,7 +6262,11 @@ def _should_compare(self, other: Index) -> bool: | |
return False | ||
|
||
dtype = _unpack_nested_dtype(other) | ||
return self._is_comparable_dtype(dtype) or is_object_dtype(dtype) | ||
return ( | ||
self._is_comparable_dtype(dtype) | ||
or is_object_dtype(dtype) | ||
or is_string_dtype(dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the is_string_dtype case needed here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because we allow to compare non-strings with strings for certain cases, I guess?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm looks like two tests fail if this is disabled, test_map_missing_mixed, test_get_indexer_datetime. The map test I'd have to look into, but the get_indexer one make sense:
I suspect there are dtype combinations for which we don't need the is_string_dtype check here, might be able to move it to _is_comparable_dtype and restore fast-paths in some cases. |
||
) | ||
|
||
def _is_comparable_dtype(self, dtype: DtypeObj) -> bool: | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,6 +51,7 @@ | |
is_number, | ||
is_object_dtype, | ||
is_scalar, | ||
is_string_dtype, | ||
pandas_dtype, | ||
) | ||
from pandas.core.dtypes.dtypes import ( | ||
|
@@ -712,7 +713,7 @@ def _get_indexer( | |
# left/right get_indexer, compare elementwise, equality -> match | ||
indexer = self._get_indexer_unique_sides(target) | ||
|
||
elif not is_object_dtype(target.dtype): | ||
elif not (is_object_dtype(target.dtype) or is_string_dtype(target.dtype)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we exclude string dtype here? id expect _get_indexer to fastpath to all-minus-ones |
||
# homogeneous scalar index: use IntervalTree | ||
# we should always have self._should_partial_index(target) here | ||
target = self._maybe_convert_i8(target) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -184,7 +184,7 @@ def test_construct_from_string_fill_value_raises(string): | |
[ | ||
(SparseDtype(int, 0), float, SparseDtype(float, 0.0)), | ||
(SparseDtype(int, 1), float, SparseDtype(float, 1.0)), | ||
(SparseDtype(int, 1), str, SparseDtype(object, "1")), | ||
(SparseDtype(int, 1), np.str_, SparseDtype(object, "1")), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we even need this test case? I'm not sure what expectations we have around the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, can also leave out this test case entirely. I am not entirely sure yet what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be fine to remove; I feel like this opens up a pandora's box of issues that aren't worth tracking down at this point in time |
||
(SparseDtype(float, 1.5), int, SparseDtype(int, 1)), | ||
], | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit ugly, but essentially this is avoiding that
astype(str)
for ArrowExtensionArray (if that happens to have a datetimelike dtype) is called here, because that will useensure_string_array
for that implementation, causing a recursion error.An alternative would be a better check than
arr.dtype.kind in "mM"
that restricts itself toDatetimeLikeArrayMixin
(essentially I need aisinstance(arr, DatetimeLikeArrayMixin)
, I think). Could potentially do that with some ABC check.Another alternative is to handle
astype(str)
specifically in ArrowExtensionArray (currentlyastype
is not implemented there, and it falls back entirely on the base class implementation)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventually it would be good to implement
ArrowExtensionArray.astype
so I would support eventually handingstr
there. Could you add aTODO
comment here describing when this could be removed whenastype
is implemented?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO comment