Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression of n_unique in group_by/agg context since 1.3.0. #18661

Closed
2 tasks done
taozuoqiao opened this issue Sep 10, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements python Related to Python Polars

Comments

@taozuoqiao
Copy link

taozuoqiao commented Sep 10, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from time import perf_counter

from matplotlib import pyplot as plt
import numpy as np
import polars as pl

print(f"polars: {pl.__version__}")

n_lst = 2**np.arange(10,20,2)
costs = []
for n in  n_lst:
    arr = np.repeat(np.arange(n),2)
    df = pl.DataFrame({'a':arr, 'b':arr})
    t = perf_counter()
    df.group_by('a').agg(pl.col('b').n_unique())
    t = perf_counter() - t
    costs.append(t)
    print(f"{n},{t}")

Log output

  • Linux Server
"""
polars: 1.2.1
1024,0.03431016765534878
4096,0.01690526120364666
16384,0.035870613530278206
65536,0.031103476881980896
262144,0.048228250816464424

polars: 1.3.0
1024,0.07050714455544949
4096,0.12646342255175114
16384,0.4367207493633032
65536,1.6754069868475199
262144,6.616759138181806

polars: 1.6.0
1024,0.06845919229090214
4096,0.12585468403995037
16384,0.4324886444956064
65536,1.6827915534377098
262144,6.5993651412427425
"""
  • Windows Desktop:
"""
polars: 1.2.1
1024,0.0018669001292437315
4096,0.0012595001608133316
16384,0.0032719001173973083
65536,0.008230299921706319
262144,0.024810299975797534
1048576,0.1400582001078874
4194304,0.5446433001197875

polars: 1.3.0
1024,0.00388179998844862
4096,0.004333199933171272
16384,0.012860999908298254
65536,0.05761150014586747
262144,0.32461009989492595
1048576,1.3081338999327272
4194304,5.261959800031036

polars: 1.6.0
1024,0.04421829991042614
4096,0.004168899962678552
16384,0.026148000033572316
65536,0.07865060004405677
262144,0.26833049999549985
1048576,0.34357220004312694
4194304,2.905561099993065

"""

Issue description

Since polars>=1.3.0, there appears to be a significant performance regression when using n_unique in group_by/agg context. The bottleneck in n_unique causes the group_by/agg context to take mins or even hours to finish in my probem setting, while it takes only seconds in polars<=1.2.1.

Expected behavior

Performance should be roughly the same as polars<=1.2.1.

Installed versions

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            Linux-5.14.0-362.24.1.el9_3.0.1.x86_64-x86_64-with-glibc2.34
Python:              3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.3.0
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              <not installed>
pydantic             2.7.4
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                2.3.0
xlsx2csv             <not installed>
xlsxwriter           <not installed>
--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            Windows-10-10.0.22631-SP0
Python:              3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:40:08) [MSC v.1938 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.4.1
cloudpickle          <not installed>
connectorx           0.3.3
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         0.11.0
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                2.4.0+cpu
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@taozuoqiao taozuoqiao added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 10, 2024
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label Sep 10, 2024
@ritchie46
Copy link
Member

Can reproduce. We know what the culrpit is. Coming up.

@ritchie46
Copy link
Member

fixed by #18666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants