Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API (string): str.center with pyarrow-backed string dtype #59624

Merged
merged 5 commits into from
Sep 2, 2024

Conversation

jbrockmendel
Copy link
Member

Two bad options for older-pyarrow: different behavior or object fallback. I guessed that @mroeschke would prefer different behavior here, would be happy to be proved wrong.

@jorisvandenbossche
Copy link
Member

Two bad options for older-pyarrow: different behavior or object fallback. I guessed that @mroeschke would prefer different behavior here, would be happy to be proved wrong.

I won't argue for ArrowDtype if there is a preference to not do object fallback there, but for StringDtype, we definitely want to fallback to object dtype, I think (as I proposed in #54807 (comment)). Otherwise this gives a breaking change for people with older pyarrow, and inconsistent behaviour depending on the pyarrow version you have installed. While we can very easily avoid this breaking change, so then we should just do that IMO

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Aug 27, 2024

Using the lean_left_on_odd_padding=False breaks test_string_array:

import numpy as np
import pandas as pd

data = ["a", "bb", np.nan, "ccc"]
ser = pd.Series(data, dtype=pd.StringDtype(storage="pyarrow"))
alt = pd.Series(data, dtype=object)

res = ser.str.center(10)
exp = alt.str.center(10).astype(ser.dtype)

>>> res[0]
'     a    '
>>> exp[0]
'    a     '

The first explanation that comes to mind is that I misunderstood something and should be passing lean_left_on_odd_padding=True, but that in turn breaks test_center_ljust_rjust_fillchar:

import pandas as pd

ser = pd.Series(["a", "bb", "cccc", "ddddd", "eeeeee"], dtype=pd.StringDtype(storage="pyarrow"))
res = ser.str.center(5, fillchar="X")
exp = ser.astype(object).str.center(5, fillchar="X").astype(ser.dtype)

>>> res[2]
'ccccX'
>>> exp[2]
'Xcccc'

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Added a few small comments

@@ -277,6 +277,7 @@ Other API changes
- Made ``dtype`` a required argument in :meth:`ExtensionArray._from_sequence_of_strings` (:issue:`56519`)
- Passing a :class:`Series` input to :func:`json_normalize` will now retain the :class:`Series` :class:`Index`, previously output had a new :class:`RangeIndex` (:issue:`51452`)
- Removed :meth:`Index.sort` which always raised a ``TypeError``. This attribute is not defined and will raise an ``AttributeError`` (:issue:`59283`)
- The ``center`` method on :class:`Series` and :class:`Index` object ``str`` accessors with pyarrow-backed dtype now matches the python behavior in corner cases with an odd number of fill characters when using pyarrow versions 17.0 and above (:issue:`54792`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to 2.3.0 file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

from pandas import array

obj_arr = self.astype(object, copy=False) # type: ignore[attr-defined]
obj = array(obj_arr, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also do obj_arr.array._str_pad(..) and then don't need this explicit construction?

obj_arr = self.astype(object, copy=False) # type: ignore[attr-defined]
obj = array(obj_arr, dtype=object)
result = obj._str_pad(width, side, fillchar) # type: ignore[attr-defined]
return type(self)._from_sequence(result, dtype=self.dtype) # type: ignore[attr-defined]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how significant this is, but one small performance disadvantage of this instead of type(self)(pa.array(result)) is that _from_sequence has an (in this case) unnecessary call to lib.ensure_string_array.

The problem is that we don't know here which pyarrow type to use? But then could do something like pa.array(result, type=self.dtype.pyarrow_dtype) to preserve the type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not inclined to worry about this too much since its in a branch that will be going away eventually

@jbrockmendel jbrockmendel changed the title API: str.center with pyarrow-backed string dtype API (string): str.center with pyarrow-backed string dtype Sep 1, 2024
@jorisvandenbossche jorisvandenbossche merged commit db1b8ab into pandas-dev:main Sep 2, 2024
47 checks passed
@jbrockmendel jbrockmendel deleted the ref-str_pad branch September 3, 2024 15:15
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: behaviour for the "str.center()" string method for the pyarrow-backed string dtype
3 participants