Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

Closed
2 tasks done
mesner opened this issue May 15, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@mesner
Copy link

mesner commented May 15, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pandas as pd
import pyarrow.parquet as pq

# download yellow_tripdata_2024-01.parquet from below
# https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

df = pl.read_parquet("yellow_tripdata_2024-01.parquet")
df.write_parquet("yellow_tripdata_2024-01.polars.parquet", compression="zstd")

pl.read_parquet("yellow_tripdata_2024-01.parquet").write_parquet("yellow_tripdata_2024-01.polars.pyarrow.parquet", compression="zstd", use_pyarrow=True)

pd.read_parquet("yellow_tripdata_2024-01.parquet").to_parquet("yellow_tripdata_2024-01.pandas.parquet", compression="zstd")

df = pq.read_table("yellow_tripdata_2024-01.parquet")
pq.write_table(df, "yellow_tripdata_2024-01.pyarrow.parquet", compression="zstd")

Log output

No response

Issue description

NOTE: I admit that this isn't a bug as the documentation makes no claim of equivalence of write_parquet to pyarrow.

I noticed that parquet files written with polars were much larger--sometimes 60% larger--than those written with pandas. I explored a little and found that setting use_pyarrow=True produces similar file sizes to pandas and pyarrow, which is not surprising.

I chose the yellow cab taxi dataset to demonstrate, which only produces a file ~25% larger. So the sizes are inconsistent.

I've noticed a lot of variation in file size. For example, using Int32 instead of Int64 produces a larger file.

Directory output follows after running the code. Note yellow_tripdata_2024-01.polars.parquet is ~25% larger than others.

49977253 May 14 17:33 yellow_tripdata_2024-01.pandas.parquet
49961641 May 14 17:24 yellow_tripdata_2024-01.parquet
63199943 May 14 17:40 yellow_tripdata_2024-01.polars.parquet
52190307 May 14 17:34 yellow_tripdata_2024-01.polars.pyarrow.parquet
49970699 May 14 17:38 yellow_tripdata_2024-01.pyarrow.parquet

Expected behavior

One might expect the resulting parquet file to be similar enough regardless of how one writes it.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Windows-10-10.0.22631-SP0
Python:               3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  0.7.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            0.14.0
fastexcel:            <not installed>
fsspec:               2023.4.0
gevent:               23.9.1
hvplot:               <not installed>
matplotlib:           3.8.0
nest_asyncio:         1.5.8
numpy:                1.24.1
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              14.0.1
pydantic:             2.4.2
pyiceberg:            0.5.0
pyxlsb:               <not installed>
sqlalchemy:           2.0.22
torch:                2.1.0+cu121
xlsx2csv:             0.8.1
xlsxwriter:           3.1.8
@mesner mesner added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 15, 2024
@owenprough-sift
Copy link

Probably related to the discussion in #15959. And possibly a duplicate of #10680?

@mesner
Copy link
Author

mesner commented May 15, 2024

Yes, likely. Sorry, I didn't see #10680

@mesner mesner closed this as completed May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants