Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.concat should return fixed size data (not 0 or 1 length series) #12053

Closed
2 tasks done
Julian-J-S opened this issue Oct 26, 2023 · 1 comment
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Oct 26, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# PROBLEM:
# Aggregation should not return different size Series but scalars (see problems and explanation below)
pl.Series(['a', 'b'], dtype=pl.Utf8).str.concat() # -> Series (1,) ["a-b"]
pl.Series([], dtype=pl.Utf8).str.concat() # -> Series (0,) []

# other aggregation (max)
pl.Series([1, 2], dtype=pl.UInt32).max() # -> 2
pl.Series([], dtype=pl.UInt32).max() # -> None

Log output

No response

Issue description

str.concat should behave like other vertical aggregation functions and allways return a single value.
Otherwise this leads to confusing / problematic behaviour like seen here: #12030

Example Problem from linked issue:

df = pl.DataFrame(
    {
        "id": ["1", "1", "2"],
        "text": ["a", "b", "c"],    # df1
        # "text": ["a", "b", None], # df2: None instead of "c"
    }
)

df.group_by("id").agg(
    list=pl.col("text").drop_nulls(),
    concat=pl.col("text").drop_nulls().str.concat(),
)

# df1: expected behaviour
┌─────┬────────────┬────────┐
│ idlistconcat │
│ ---------    │
│ strlist[str]  ┆ str    │
╞═════╪════════════╪════════╡
│ 1   ┆ ["a", "b"] ┆ a-b    │
│ 2   ┆ ["c"]      ┆ c      │
└─────┴────────────┴────────┘

# df2: whhhyyy? =)
┌─────┬────────────┬───────────┐
│ idlistconcat    │
│ ---------       │
│ strlist[str]  ┆ list[str] │ # >>>>> expect "str"
╞═════╪════════════╪═══════════╡
│ 1   ┆ ["a", "b"] ┆ ["a-b"]   │ # >>>>> expect "a-b" like above2   ┆ []         ┆ []        │ # >>>>> expect "" because `str.concat` on empyt list should be "" not Shape (0,) Series
└─────┴────────────┴───────────┘

Doing the same where the aggregation always produces a single value works fine:

pl.DataFrame(
    {
        "id": ["1", "1", "2"],
        "text": [10, 20, None],
    }
).group_by("id").agg(
    list=pl.col("text").drop_nulls(),
    concat=pl.col("text").drop_nulls().sum(),
)
┌─────┬───────────┬────────┐
│ idlistconcat │
│ ---------    │
│ strlist[i64] ┆ i64    │
╞═════╪═══════════╪════════╡
│ 1   ┆ [10, 20]  ┆ 30     │
│ 2   ┆ []        ┆ 0# <<<<< working as expected because [].sum = 0
└─────┴───────────┴────────┘

Expected behavior

str.concat should always return a single value

pl.Series(['a', 'b'], dtype=pl.Utf8).str.concat() # -> "a-b"
pl.Series([], dtype=pl.Utf8).str.concat() # -> "" (mimics join in rust & python on empty list)

Installed versions

--------Version info---------
Polars:              0.19.11
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]```

</details>
@Julian-J-S Julian-J-S added bug Something isn't working python Related to Python Polars labels Oct 26, 2023
@reswqa
Copy link
Collaborator

reswqa commented Oct 30, 2023

Closed as complete in #12066.

@reswqa reswqa closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants