polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238
Closed
2 tasks done
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
Log output
No response
Issue description
NOTE: I admit that this isn't a bug as the documentation makes no claim of equivalence of write_parquet to pyarrow.
I noticed that parquet files written with polars were much larger--sometimes 60% larger--than those written with pandas. I explored a little and found that setting use_pyarrow=True produces similar file sizes to pandas and pyarrow, which is not surprising.
I chose the yellow cab taxi dataset to demonstrate, which only produces a file ~25% larger. So the sizes are inconsistent.
I've noticed a lot of variation in file size. For example, using Int32 instead of Int64 produces a larger file.
Directory output follows after running the code. Note
yellow_tripdata_2024-01.polars.parquet
is ~25% larger than others.Expected behavior
One might expect the resulting parquet file to be similar enough regardless of how one writes it.
Installed versions
The text was updated successfully, but these errors were encountered: