Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop_nulls in group_by creates list now? #12030

Closed
2 tasks done
ion-elgreco opened this issue Oct 25, 2023 · 10 comments
Closed
2 tasks done

Drop_nulls in group_by creates list now? #12030

ion-elgreco opened this issue Oct 25, 2023 · 10 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@ion-elgreco
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({
    "id":["1","1","3","2","2"],
    "text":['a','b', None,'c','d']
})

df.groupby("id").agg(
        pl.col('text').drop_nulls().str.concat()
)

shape: (3, 2)
┌─────┬───────────┐
│ idtext      │
│ ------       │
│ strlist[str] │
╞═════╪═══════════╡
│ 3   ┆ []        │
│ 1   ┆ ["a-b"]   │
│ 2   ┆ ["c-d"]   │
└─────┴───────────┘

Log output

No response

Issue description

Current .drop_nulls() before str.concat creates a list column. But it didn't this in v0.18.15.

I do like that I can now remove .drop_nulls and str.concat ignores the null while concatenating, so I can get the same and improved behavour by doing this:

df.groupby("id").agg(
        pl.col('text').str.concat()
)

But I don't think it should create a list when you do drop_nulls, but I may be wrong :D

Expected behavior

Don't create a list column but give same output as if you did:

df.groupby("id").agg(
        pl.col('text').str.concat()
)

shape: (3, 2)
┌─────┬──────┐
│ idtext │
│ ------  │
│ strstr  │
╞═════╪══════╡
│ 3null │
│ 2c-d  │
│ 1a-b  │
└─────┴──────┘

Installed versions

--------Version info---------
Polars:              0.19.11
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:              3.10.12 (main, Aug  9 2023, 14:47:34) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.2.1
connectorx:          <not installed>
deltalake:           0.12.0
fsspec:              2023.9.2
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               1.23.5
openpyxl:            <not installed>
pandas:              1.5.3
pyarrow:             11.0.0
pydantic:            1.10.13
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          2.0.22
xlsx2csv:            0.8.1
xlsxwriter:          3.1.9
@ion-elgreco ion-elgreco added bug Something isn't working python Related to Python Polars labels Oct 25, 2023
@reswqa
Copy link
Collaborator

reswqa commented Oct 25, 2023

The key point is that for group(id = 3), it will be empty after drop_nulls. We only auto explode in agg context if all list elements are unit length.

If you add another non-none element with id = 3, the result dtype will be str instead of list[str]

df = pl.DataFrame({
        "id":["1","1","3","3","2","2"],
        "text":['a','b', None,'e','c','d']
    })

df.group_by("id").agg(
        pl.col('text').drop_nulls().str.concat()
)
┌─────┬──────┐
│ idtext │
│ ------  │
│ strstr  │
╞═════╪══════╡
│ 2c-d  │
│ 1a-b  │
│ 3e    │
└─────┴──────┘

@ion-elgreco
Copy link
Contributor Author

The key point is that for group(id = 3), it will be empty after drop_nulls. We only auto explode in agg context if all list elements are unit length.

If you add another non-none element with id = 3, the result dtype will be str instead of list[str]

df = pl.DataFrame({
        "id":["1","1","3","3","2","2"],
        "text":['a','b', None,'e','c','d']
    })

df.group_by("id").agg(
        pl.col('text').drop_nulls().str.concat()
)
┌─────┬──────┐
│ idtext │
│ ------  │
│ strstr  │
╞═════╪══════╡
│ 2c-d  │
│ 1a-b  │
│ 3e    │
└─────┴──────┘

Hmm but that's new behavior then. in v0.18.15 it always auto exploded.

@reswqa
Copy link
Collaborator

reswqa commented Oct 25, 2023

All right, this is caused by we change the behavior of str.concat of empty input.

pl.select(
        pl.Series([], dtype=pl.Utf8).str.concat()
 )

0.19.11
image

0.18.15
image

As a result, it will not auto exploded. But I feel that the new behavior of star.concat is better.

@ion-elgreco
Copy link
Contributor Author

@reswqa
I do agree that new behaviour of str.concat is better! It was just unexcepted 😄 I am trying to migrate our code base from 0.18.15 to 0.19+ but there are many unexcepted behaviors in 0.19

@ion-elgreco
Copy link
Contributor Author

Closing it since this it was to be expected due to change in str.concat

@Julian-J-S
Copy link
Contributor

ugh, I hope this is still open for discussion :O

afaik @ritchie46 always wanted a single return type per expression and no "surprises" depending on data which makes total sense.
And if we get a single Utf8 (or null) per aggregation group there should be no need for a list?

df = pl.DataFrame(
    {
        "id": ["1", "1", "2"],
        "text": ["a", "b", "c"],    # df1
        # "text": ["a", "b", None], # df2: None instead of "c"
    }
)

df.group_by("id").agg(
    list=pl.col("text").drop_nulls(),
    concat=pl.col("text").drop_nulls().str.concat(),
)

# df1: expected behaviour
┌─────┬────────────┬────────┐
│ idlistconcat │
│ ---------    │
│ strlist[str]  ┆ str    │
╞═════╪════════════╪════════╡
│ 1   ┆ ["a", "b"] ┆ a-b    │
│ 2   ┆ ["c"]      ┆ c      │
└─────┴────────────┴────────┘

# df2: whhhyyy? =)
┌─────┬────────────┬───────────┐
│ idlistconcat    │
│ ---------       │
│ strlist[str]  ┆ list[str] │ # >>>>> expect "str"
╞═════╪════════════╪═══════════╡
│ 1   ┆ ["a", "b"] ┆ ["a-b"]   │ # >>>>> expect "a-b" like above2   ┆ []         ┆ []        │ # >>>>> expect null because `str.concat` on empyt list is null
└─────┴────────────┴───────────┘

I honestly think this is very dangerous for data transformations because imagine if your following transformation does some str.xxx manipulation on this column and suddendly you get a list[str] instead of str ?

@ion-elgreco ion-elgreco reopened this Oct 26, 2023
@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Oct 26, 2023

@JulianCologne @reswqa maybe it should also just auto explode when something is null, then you don't get unexpected results

@reswqa
Copy link
Collaborator

reswqa commented Oct 26, 2023

expect null because str.concat on empty list is null

No, It's an empty series indeed. This is also the reason why it can not be exploded now. If you get an aggregated list like [[1], [2],[None]], it can indeed be exploded to [1, 2, None], but if you get [[1], [2],[]], how should we deal with the last empty element? Maybe fill in None, maybe not. 🤔

maybe it should also just auto explode when something is null, then you don't get unexpected results

For aggregated list like [[1], [2],None], we can indeed make it exploded to [1,2,None], and this might be worth discussing.

@ion-elgreco
Copy link
Contributor Author

@JulianCologne we can close this now, right?

@reswqa
Copy link
Collaborator

reswqa commented Nov 1, 2023

I think we can.

Closed as complete in #12066.

@reswqa reswqa closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants