Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of string changes for 2.3 release - part 2 #60013

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
faed77b
String dtype: fix pyarrow-based IO + update tests (#59478)
jorisvandenbossche Aug 22, 2024
ac9bf9c
REF (string): avoid copy in StringArray factorize (#59551)
jbrockmendel Aug 22, 2024
8386795
String dtype: avoid surfacing pyarrow exception in binary operations …
jorisvandenbossche Aug 27, 2024
5783aa4
DOC: Add whatsnew for 2.3.0 (#59625)
jorisvandenbossche Aug 27, 2024
1833ccb
BUG (string): str.replace with negative n (#59628)
jbrockmendel Aug 27, 2024
62b474b
TST (string): fix xfailed groupby value_counts tests (#59632)
jbrockmendel Aug 28, 2024
972369f
REF (string): rename result converter methods (#59626)
jbrockmendel Aug 28, 2024
b350a97
TST (string) fix xfailed groupby tests (3) (#59642)
jbrockmendel Aug 28, 2024
3121121
REF (string): de-duplicate str_endswith, startswith (#59568)
jbrockmendel Aug 29, 2024
866a7f6
DEPR (string): non-bool na for obj.str.contains (#59615)
jbrockmendel Aug 31, 2024
b313cf5
TST (string dtype): fix and clean up arrow roundtrip tests (#59678)
jorisvandenbossche Sep 2, 2024
449a094
API (string): str.center with pyarrow-backed string dtype (#59624)
jbrockmendel Sep 2, 2024
63dbe97
REF (string): de-duplicate str_isfoo methods (#59705)
jbrockmendel Sep 4, 2024
2f4af6b
TST (string): copy/view tests (#59702)
jbrockmendel Sep 4, 2024
c807def
TST (string): more targeted xfails in test_string.py (#59703)
jbrockmendel Sep 4, 2024
553780a
REF (string): de-duplicate _str_contains (#59709)
jbrockmendel Sep 5, 2024
44325c1
BUG (string): ArrowStringArray.find corner cases (#59562)
jbrockmendel Sep 6, 2024
ccb90e3
String dtype: implement _get_common_dtype (#59682)
jorisvandenbossche Sep 6, 2024
79dd74d
TST/BUG (string dtype): Fix and adjust indexes string tests (#59544)
phofl Sep 9, 2024
743c682
TST (string dtype): Adjust indexing string tests (#59541)
phofl Sep 9, 2024
bf47ce6
TST (string dtype): adjust pandas/tests/reshape tests (#59762)
jorisvandenbossche Sep 9, 2024
74c6fac
BUG (string dtype): fix inplace mutation with copy=False in ensure_st…
jorisvandenbossche Sep 9, 2024
ca24b42
TST (string dtype): remove usage of 'string[pyarrow_numpy]' alias (#5…
jorisvandenbossche Sep 9, 2024
418f890
BUG (string): Series.str.slice with negative step (#59724)
jbrockmendel Sep 10, 2024
26a0d56
String dtype: remove fallback Perfomance warnings for string methods …
jorisvandenbossche Sep 10, 2024
c8eadfd
REF (string): de-duplicate ArrowStringArray methods (#59555)
jbrockmendel Sep 11, 2024
37886a6
BUG/API (string dtype): return float dtype for series[str].rank() (#5…
jorisvandenbossche Sep 12, 2024
532b9a1
String dtype: fix isin() values handling for python storage (#59759)
jorisvandenbossche Sep 12, 2024
4ff2c68
String dtype: allow string dtype in query/eval with default numexpr e…
jorisvandenbossche Sep 16, 2024
2789338
String dtype: map builtin str alias to StringDtype (#59685)
jorisvandenbossche Sep 25, 2024
53ac224
String dtype: allow string dtype for non-raw apply with numba engine …
jorisvandenbossche Sep 25, 2024
ed78032
fixup rank test
jorisvandenbossche Oct 10, 2024
581582b
update tests
jorisvandenbossche Oct 10, 2024
1d1e3da
fix linting
jorisvandenbossche Oct 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/source/whatsnew/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ This is the list of changes to pandas between each release. For full details,
see the `commit logs <https://github.com/pandas-dev/pandas/commits/>`_. For install and
upgrade instructions, see :ref:`install`.

Version 2.3
-----------

.. toctree::
:maxdepth: 2

v2.3.0

Version 2.2
-----------

Expand Down
180 changes: 180 additions & 0 deletions doc/source/whatsnew/v2.3.0.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
.. _whatsnew_230:

What's new in 2.3.0 (Month XX, 2024)
------------------------------------

These are the changes in pandas 2.3.0. See :ref:`release` for a full changelog
including other versions of pandas.

{{ header }}

.. ---------------------------------------------------------------------------

.. _whatsnew_230.upcoming_changes:

Upcoming changes in pandas 3.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


.. _whatsnew_230.enhancements:

Enhancements
~~~~~~~~~~~~

.. _whatsnew_230.enhancements.enhancement1:

enhancement1
^^^^^^^^^^^^


.. _whatsnew_230.enhancements.other:

Other enhancements
^^^^^^^^^^^^^^^^^^

-
-

.. ---------------------------------------------------------------------------
.. _whatsnew_230.notable_bug_fixes:

Notable bug fixes
~~~~~~~~~~~~~~~~~

These are bug fixes that might have notable behavior changes.

.. _whatsnew_230.notable_bug_fixes.notable_bug_fix1:

notable_bug_fix1
^^^^^^^^^^^^^^^^

.. ---------------------------------------------------------------------------
.. _whatsnew_230.deprecations:

Deprecations
~~~~~~~~~~~~
- Deprecated allowing non-``bool`` values for ``na`` in :meth:`.str.contains`, :meth:`.str.startswith`, and :meth:`.str.endswith` for dtypes that do not already disallow these (:issue:`59615`)
-

.. ---------------------------------------------------------------------------
.. _whatsnew_230.performance:

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
-
-

.. ---------------------------------------------------------------------------
.. _whatsnew_230.bug_fixes:

Bug fixes
~~~~~~~~~

Categorical
^^^^^^^^^^^
-
-

Datetimelike
^^^^^^^^^^^^
-
-

Timedelta
^^^^^^^^^
-
-

Timezones
^^^^^^^^^
-
-

Numeric
^^^^^^^
-
-

Conversion
^^^^^^^^^^
-
-

Strings
^^^^^^^
- Bug in :meth:`Series.rank` for :class:`StringDtype` with ``storage="pyarrow"`` incorrectly returning integer results in case of ``method="average"`` and raising an error if it would truncate results (:issue:`59768`)
- Bug in :meth:`Series.str.replace` when ``n < 0`` for :class:`StringDtype` with ``storage="pyarrow"`` (:issue:`59628`)
- Bug in ``ser.str.slice`` with negative ``step`` with :class:`ArrowDtype` and :class:`StringDtype` with ``storage="pyarrow"`` giving incorrect results (:issue:`59710`)
- Bug in the ``center`` method on :class:`Series` and :class:`Index` object ``str`` accessors with pyarrow-backed dtype not matching the python behavior in corner cases with an odd number of fill characters (:issue:`54792`)
-

Interval
^^^^^^^^
-
-

Indexing
^^^^^^^^
-
-

Missing
^^^^^^^
-
-

MultiIndex
^^^^^^^^^^
-
-

I/O
^^^
-
-

Period
^^^^^^
-
-

Plotting
^^^^^^^^
-
-

Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
-
-

Reshaping
^^^^^^^^^
-
-

Sparse
^^^^^^
-
-

ExtensionArray
^^^^^^^^^^^^^^
-
-

Styler
^^^^^^
-
-

Other
^^^^^
-
-

.. ---------------------------------------------------------------------------
.. _whatsnew_230.contributors:

Contributors
~~~~~~~~~~~~
4 changes: 4 additions & 0 deletions pandas/_libs/arrays.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ cdef class NDArrayBacked:
"""
Construct a new ExtensionArray `new_array` with `arr` as its _ndarray.

The returned array has the same dtype as self.

Caller is responsible for ensuring `values.dtype == self._ndarray.dtype`.

This should round-trip:
self == self._from_backing_data(self._ndarray)
"""
Expand Down
5 changes: 4 additions & 1 deletion pandas/_libs/hashtable.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,10 @@ from pandas._libs.khash cimport (
kh_python_hash_func,
khiter_t,
)
from pandas._libs.missing cimport checknull
from pandas._libs.missing cimport (
checknull,
is_matching_na,
)


def get_hashtable_trace_domain():
Expand Down
18 changes: 15 additions & 3 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -1121,11 +1121,13 @@ cdef class StringHashTable(HashTable):
const char **vecs
khiter_t k
bint use_na_value
bint non_null_na_value

if return_inverse:
labels = np.zeros(n, dtype=np.intp)
uindexer = np.empty(n, dtype=np.int64)
use_na_value = na_value is not None
non_null_na_value = not checknull(na_value)

# assign pointers and pre-filter out missing (if ignore_na)
vecs = <const char **>malloc(n * sizeof(char *))
Expand All @@ -1134,7 +1136,12 @@ cdef class StringHashTable(HashTable):

if (ignore_na
and (not isinstance(val, str)
or (use_na_value and val == na_value))):
or (use_na_value and (
(non_null_na_value and val == na_value) or
(not non_null_na_value and is_matching_na(val, na_value)))
)
)
):
# if missing values do not count as unique values (i.e. if
# ignore_na is True), we can skip the actual value, and
# replace the label with na_sentinel directly
Expand Down Expand Up @@ -1400,18 +1407,23 @@ cdef class PyObjectHashTable(HashTable):
object val
khiter_t k
bint use_na_value

bint non_null_na_value
if return_inverse:
labels = np.empty(n, dtype=np.intp)
use_na_value = na_value is not None
non_null_na_value = not checknull(na_value)

for i in range(n):
val = values[i]
hash(val)

if ignore_na and (
checknull(val)
or (use_na_value and val == na_value)
or (use_na_value and (
(non_null_na_value and val == na_value) or
(not non_null_na_value and is_matching_na(val, na_value))
)
)
):
# if missing values do not count as unique values (i.e. if
# ignore_na is True), skip the hashtable entry for them, and
Expand Down
28 changes: 22 additions & 6 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -736,7 +736,9 @@ cpdef ndarray[object] ensure_string_array(
convert_na_value : bool, default True
If False, existing na values will be used unchanged in the new array.
copy : bool, default True
Whether to ensure that a new array is returned.
Whether to ensure that a new array is returned. When True, a new array
is always returned. When False, a new array is only returned when needed
to avoid mutating the input array.
skipna : bool, default True
Whether or not to coerce nulls to their stringified form
(e.g. if False, NaN becomes 'nan').
Expand All @@ -753,7 +755,14 @@ cpdef ndarray[object] ensure_string_array(

if hasattr(arr, "to_numpy"):

if hasattr(arr, "dtype") and arr.dtype.kind in "mM":
if (
hasattr(arr, "dtype")
and arr.dtype.kind in "mM"
# TODO: we should add a custom ArrowExtensionArray.astype implementation
# that handles astype(str) specifically, avoiding ending up here and
# then we can remove the below check for `_pa_array` (for ArrowEA)
and not hasattr(arr, "_pa_array")
):
# dtype check to exclude DataFrame
# GH#41409 TODO: not a great place for this
out = arr.astype(str).astype(object)
Expand All @@ -765,10 +774,17 @@ cpdef ndarray[object] ensure_string_array(

result = np.asarray(arr, dtype="object")

if copy and (result is arr or np.shares_memory(arr, result)):
# GH#54654
result = result.copy()
elif not copy and result is arr:
if result is arr or np.may_share_memory(arr, result):
# if np.asarray(..) did not make a copy of the input arr, we still need
# to do that to avoid mutating the input array
# GH#54654: share_memory check is needed for rare cases where np.asarray
# returns a new object without making a copy of the actual data
if copy:
result = result.copy()
else:
already_copied = False
elif not copy and not result.flags.writeable:
# Weird edge case where result is a view
already_copied = False

if issubclass(arr.dtype.type, np.str_):
Expand Down
2 changes: 1 addition & 1 deletion pandas/_testing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@

COMPLEX_DTYPES: list[Dtype] = [complex, "complex64", "complex128"]
if using_string_dtype():
STRING_DTYPES: list[Dtype] = [str, "U"]
STRING_DTYPES: list[Dtype] = ["U"]
else:
STRING_DTYPES: list[Dtype] = [str, "str", "U"] # type: ignore[no-redef]
COMPLEX_FLOAT_DTYPES: list[Dtype] = [*COMPLEX_DTYPES, *FLOAT_NUMPY_DTYPES]
Expand Down
37 changes: 36 additions & 1 deletion pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1228,6 +1228,34 @@ def string_dtype(request):
return request.param


@pytest.fixture(
params=[
("python", pd.NA),
pytest.param(("pyarrow", pd.NA), marks=td.skip_if_no("pyarrow")),
pytest.param(("pyarrow", np.nan), marks=td.skip_if_no("pyarrow")),
("python", np.nan),
],
ids=[
"string=string[python]",
"string=string[pyarrow]",
"string=str[pyarrow]",
"string=str[python]",
],
)
def string_dtype_no_object(request):
"""
Parametrized fixture for string dtypes.
* 'string[python]' (NA variant)
* 'string[pyarrow]' (NA variant)
* 'str' (NaN variant, with pyarrow)
* 'str' (NaN variant, without pyarrow)
"""
# need to instantiate the StringDtype here instead of in the params
# to avoid importing pyarrow during test collection
storage, na_value = request.param
return pd.StringDtype(storage, na_value)


@pytest.fixture(
params=[
"string[python]",
Expand Down Expand Up @@ -1266,7 +1294,13 @@ def string_storage(request):
pytest.param(("pyarrow", pd.NA), marks=td.skip_if_no("pyarrow")),
pytest.param(("pyarrow", np.nan), marks=td.skip_if_no("pyarrow")),
("python", np.nan),
]
],
ids=[
"string=string[python]",
"string=string[pyarrow]",
"string=str[pyarrow]",
"string=str[python]",
],
)
def string_dtype_arguments(request):
"""
Expand Down Expand Up @@ -1297,6 +1331,7 @@ def dtype_backend(request):

# Alias so we can test with cartesian product of string_storage
string_storage2 = string_storage
string_dtype_arguments2 = string_dtype_arguments


@pytest.fixture(params=tm.BYTES_DTYPES)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/_numba/extensions.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@
@contextmanager
def set_numba_data(index: Index):
numba_data = index._data
if numba_data.dtype == object:
if numba_data.dtype in (object, "string"):
numba_data = np.asarray(numba_data)
if not lib.is_string_array(numba_data):
raise ValueError(
"The numba engine only supports using string or numeric column names"
Expand Down
Loading
Loading