Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nulls are not encoded efficiently #16201

Closed
2 tasks done
adriangb opened this issue May 13, 2024 · 3 comments
Closed
2 tasks done

Nulls are not encoded efficiently #16201

adriangb opened this issue May 13, 2024 · 3 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@adriangb
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
import polars as pl

df = pl.DataFrame(
    {'a': [1] + [None] * 1_000_000}
)

df.write_parquet('z1.parquet')
print(os.path.getsize('z1.parquet'))  # 125976

df.write_parquet('z2.parquet', use_pyarrow=True)
print(os.path.getsize('z2.parquet'))  # 600

Log output

No response

Issue description

Nulls are not encoded efficiently.

Expected behavior

Output file size matches pyarrow (give or take).

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.12.3 (main, Apr  9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.17.4
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              16.0.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.24
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@adriangb adriangb added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 13, 2024
@s-banach
Copy link
Contributor

s-banach commented May 14, 2024

The file size was smaller in the previous version, polars==0.20.24, which included #15959.
(I get 899 as the size for z1.parquet.)
However, that version had a data corruption bug, so it was reverted.
#16125 should be available in the next polars release.

The file size reduction doesn't really have anything to do with nulls per se, just repeated values.

@ritchie46
Copy link
Member

I think we can close this then, as #16125 will be released soon.

@adriangb
Copy link
Author

Thanks folks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants