BUG: Impossible creation of array with dtype=string #61263

Manju080 · 2025-04-09T19:22:02Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

I’ve created a fix that raises a ValueError when trying to create a StringArray from a list of lists with inconsistent lengths or non-character elements. This aligns the behavior for both consistent and inconsistent input formats and also tested.

I've would like to hear opinion to raise an error when a list of lists is passed for dtype=StringDtype, to avoid ambiguous behavior. If preferred, we could instead join the inner lists into strings automatically — happy to adjust based on guidance.
Example case : pd.array([["t", "e", "s", "t"], ["w", "o", "r", "d"]], dtype="string")
output : <StringArray> ['test', 'word'] Length: 2, dtype: string

Thanks

…cation issues (pandas-dev#60954)

…cation issues (pandas-dev#60954) with changes

Co-authored-by: Matthew Roeschke <[email protected]>

…hs (pandas-dev#61155)

pandas/core/arrays/string_.py

rhshadrach · 2025-04-13T18:00:15Z

Also, please add a test for this.

Manju080 · 2025-04-16T17:56:32Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

rhshadrach

We use pytest for testing, you'll need to add a test using that format. See here:

https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#using-pytest

The general pytest introduction may also be useful:

https://docs.pytest.org/en/7.1.x/getting-started.html

Manju080 · 2025-04-20T16:36:27Z

We use pytest for testing, you'll need to add a test using that format. See here:

https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#using-pytest

The general pytest introduction may also be useful:

https://docs.pytest.org/en/7.1.x/getting-started.html

Thank you for the details, will work on it

Manju080 · 2025-04-23T19:00:56Z

@rhshadrach I’ve been testing the following case in test_lib.py

def test_ensure_string_array_list_of_lists():
# GH#61155: ensure list of lists doesn't get converted to string
arr = [['t', 'e', 's', 't'], ['w', 'o', 'r', 'd']]
result = lib.ensure_string_array(arr)

# Each item in result should still be a list, not a stringified version
assert isinstance(result[0], list)
assert isinstance(result[1], list)
assert result[0] == ['t', 'e', 's', 't']
assert result[1] == ['w', 'o', 'r', 'd']

However, the test fails with
FAILED pandas/tests/libs/test_lib.py::test_ensure_string_array_list_of_lists - AssertionError
DEBUG RESULT: ["['t', 'e', 's', 't']" "['w', 'o', 'r', 'd']"] <class 'numpy.ndarray'> <class 'str'>

So currently, the list of lists gets converted into a 1D NumPy array of strings.
With the current implementation, arr becomes a 1D object array of lists (as intended), but it seems that downstream processing stringifies each list.
Do you want me to guard against this case inside ensure_string_array to preserve the list structure? Or is the stringification expected behavior in this context?

Thanks!

rhshadrach · 2025-04-25T20:02:47Z

I believe converting to a 1-dimesional ndarray of strings is the expected behavior of enusure_string_array. Perhaps I'm misunderstanding; what is the alternative?

Manju080 · 2025-05-05T18:38:17Z

Thanks for the clarification!

You're right — the behavior of ensure_string_array producing a 1D ndarray of stringified inner lists (when given a list of lists like [list("test"), list("word")]) is consistent with the current expectations of the function.

def test_ensure_string_array_list_of_lists():
arr = [list("test"), list("word")]
result = lib.ensure_string_array(arr)
assert isinstance(result, np.ndarray)
assert result.dtype == object
assert result[0] == "['t', 'e', 's', 't']"
assert result[1] == "['w', 'o', 'r', 'd']"
print("DEBUG RESULT:", result)

My initial assumption was that it should preserve the list structure instead of converting to strings, but after re-evaluating and running the test, I see that the 1D array of strings is indeed the intended behavior. The test has now been updated and passes successfully and got the below output
[1/1] Generating write_version_file with a custom command
================================================= test session starts
==================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/c/Users/HP/Documents/Python_pandas_op/pandas
configfile: pyproject.toml
plugins: hypothesis-6.131.6
collected 84 items
pandas/tests/libs/test_lib.py ...................................................................................DEBUG RESULT: ["['t', 'e', 's', 't']" "['w', 'o', 'r', 'd']"]`

----------------- generated xml file: /mnt/c/Users/HP/Documents/Python_pandas_op/pandas/test-data.xml ------------------
================================================= slowest 30 durations
=================================================
0.09s setup pandas/tests/libs/test_lib.py::TestMisc::test_max_len_string_array

(29 durations < 0.005s hidden. Use -vv to show these durations.)

Please let me know if I need to change anything

rhshadrach · 2025-05-06T02:20:56Z

@Manju080 - the last change I'm seeing is from 3 weeks ago. Perhaps you need to push some commits?

Manju080 · 2025-05-06T03:37:24Z

That's right, I just wanna make sure before committing the changes.

…o bugfix-61155

for more information, see https://pre-commit.ci

rhshadrach

Looking good!

pandas/tests/libs/test_lib.py

pandas/tests/arrays/test_string_array.py

pandas/core/indexes/base.py

pandas/core/arrays/string_.py

Co-authored-by: Richard Shadrach <[email protected]>

Manju080 · 2025-05-08T16:00:25Z

Apologies for the causing confusion, I will work this to fix.

Manju080 · 2025-05-08T17:08:59Z

@rhshadrach Thank you very much, required changes are done.
Let me know if there is anything

Manju080 · 2025-05-10T07:28:25Z

pre-commit.ci autofix

rhshadrach

lgtm

mroeschke · 2025-05-15T16:13:28Z

Thanks @Manju080

* DOC: Update warning in Index.values docstring to clarify index modification issues (pandas-dev#60954) * DOC: Update warning in Index.values docstring to clarify index modification issues (pandas-dev#60954) with changes * Update pandas/core/indexes/base.py Co-authored-by: Matthew Roeschke <[email protected]> * DOC : Fixing the whitespace which was causing error * Fixed docstring validation and formatting issues * BUG: Fix array creation for string dtype with inconsistent list lengths (pandas-dev#61155) * BUG: Fix array creation for string dtype with inconsistent list lengths (pandas-dev#61155) * BUG fix GH#61155 v2 * BUG fix GH#61155 with test case for list of lists handling * Fix formatting in test_string_array.py (pre-commit autofix) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add test for list of lists handling in ensure_string_array (GH#61155) * fixing checks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update pandas/tests/libs/test_lib.py Co-authored-by: Richard Shadrach <[email protected]> * Remove pandas/tests/arrays/test_string_array.py as requested * wrong fiel base.py * Remove check for nested lists in scalars in string_.py first try * Revert unintended changes to base.py --------- Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Richard Shadrach <[email protected]>

Manju080 and others added 7 commits March 6, 2025 17:16

DOC: Update warning in Index.values docstring to clarify index modifi…

3474043

…cation issues (pandas-dev#60954)

DOC: Update warning in Index.values docstring to clarify index modifi…

d070b06

…cation issues (pandas-dev#60954) with changes

Update pandas/core/indexes/base.py

b00ba12

Co-authored-by: Matthew Roeschke <[email protected]>

DOC : Fixing the whitespace which was causing error

390c8be

Fixed docstring validation and formatting issues

e58f383

BUG: Fix array creation for string dtype with inconsistent list lengt…

a505d35

…hs (pandas-dev#61155)

BUG: Fix array creation for string dtype with inconsistent list lengt…

6f5c4d4

…hs (pandas-dev#61155)

rhshadrach requested changes Apr 13, 2025

View reviewed changes

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

Manju080 added 2 commits April 15, 2025 16:30

BUG fix GH#61155 v2

fc4653d

BUG fix GH#61155 with test case for list of lists handling

ae36cf7

Manju080 requested a review from WillAyd as a code owner April 15, 2025 17:05

simonjayhawkins changed the title ~~Bugfix 61155~~ BUG: Impossible creation of array with dtype=string Apr 16, 2025

simonjayhawkins added Bug Strings String extension data type and string data labels Apr 16, 2025

Fix formatting in test_string_array.py (pre-commit autofix)

fb965e7

[pre-commit.ci] auto fixes from pre-commit.com hooks

d3bbeaf

for more information, see https://pre-commit.ci

rhshadrach reviewed Apr 19, 2025

View reviewed changes

Manju080 and others added 4 commits May 6, 2025 16:52

Add test for list of lists handling in ensure_string_array (GH#61155)

4bf8a07

Merge branch 'bugfix-61155' of https://github.com/Manju080/pandas int…

e81e1da

…o bugfix-61155

fixing checks

0ca4a18

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a4a54d

for more information, see https://pre-commit.ci

rhshadrach requested changes May 8, 2025

View reviewed changes

pandas/tests/libs/test_lib.py Outdated Show resolved Hide resolved

pandas/tests/arrays/test_string_array.py Outdated Show resolved Hide resolved

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

Update pandas/tests/libs/test_lib.py

90a74ef

Co-authored-by: Richard Shadrach <[email protected]>

Manju080 added 3 commits May 8, 2025 16:39

Remove pandas/tests/arrays/test_string_array.py as requested

4db751d

wrong fiel base.py

9979a8d

Remove check for nested lists in scalars in string_.py first try

71f7adc

Revert unintended changes to base.py

c2f7c39

rhshadrach approved these changes May 10, 2025

View reviewed changes

mroeschke added this to the 3.0 milestone May 15, 2025

mroeschke approved these changes May 15, 2025

View reviewed changes

mroeschke merged commit 29e0146 into pandas-dev:main May 15, 2025
43 of 44 checks passed

Uh oh!

BUG: Impossible creation of array with dtype=string #61263

BUG: Impossible creation of array with dtype=string #61263

Uh oh!

Conversation

Manju080 commented Apr 9, 2025 • edited by simonjayhawkins Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Apr 13, 2025

Uh oh!

Manju080 commented Apr 16, 2025

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Manju080 commented Apr 20, 2025

Uh oh!

Manju080 commented Apr 23, 2025

Uh oh!

rhshadrach commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manju080 commented May 5, 2025

Uh oh!

rhshadrach commented May 6, 2025

Uh oh!

Manju080 commented May 6, 2025

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Manju080 commented May 8, 2025

Uh oh!

Manju080 commented May 8, 2025

Uh oh!

Manju080 commented May 10, 2025

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mroeschke commented May 15, 2025

Uh oh!

Uh oh!

Manju080 commented Apr 9, 2025 •

edited by simonjayhawkins

Loading

rhshadrach commented Apr 25, 2025 •

edited

Loading