polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

mesner · 2024-05-15T12:36:56Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pandas as pd
import pyarrow.parquet as pq

# download yellow_tripdata_2024-01.parquet from below
# https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

df = pl.read_parquet("yellow_tripdata_2024-01.parquet")
df.write_parquet("yellow_tripdata_2024-01.polars.parquet", compression="zstd")

pl.read_parquet("yellow_tripdata_2024-01.parquet").write_parquet("yellow_tripdata_2024-01.polars.pyarrow.parquet", compression="zstd", use_pyarrow=True)

pd.read_parquet("yellow_tripdata_2024-01.parquet").to_parquet("yellow_tripdata_2024-01.pandas.parquet", compression="zstd")

df = pq.read_table("yellow_tripdata_2024-01.parquet")
pq.write_table(df, "yellow_tripdata_2024-01.pyarrow.parquet", compression="zstd")

Log output

No response

Issue description

NOTE: I admit that this isn't a bug as the documentation makes no claim of equivalence of write_parquet to pyarrow.

I noticed that parquet files written with polars were much larger--sometimes 60% larger--than those written with pandas. I explored a little and found that setting use_pyarrow=True produces similar file sizes to pandas and pyarrow, which is not surprising.

I chose the yellow cab taxi dataset to demonstrate, which only produces a file ~25% larger. So the sizes are inconsistent.

I've noticed a lot of variation in file size. For example, using Int32 instead of Int64 produces a larger file.

Directory output follows after running the code. Note yellow_tripdata_2024-01.polars.parquet is ~25% larger than others.

49977253 May 14 17:33 yellow_tripdata_2024-01.pandas.parquet
49961641 May 14 17:24 yellow_tripdata_2024-01.parquet
63199943 May 14 17:40 yellow_tripdata_2024-01.polars.parquet
52190307 May 14 17:34 yellow_tripdata_2024-01.polars.pyarrow.parquet
49970699 May 14 17:38 yellow_tripdata_2024-01.pyarrow.parquet

Expected behavior

One might expect the resulting parquet file to be similar enough regardless of how one writes it.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Windows-10-10.0.22631-SP0
Python:               3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  0.7.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            0.14.0
fastexcel:            <not installed>
fsspec:               2023.4.0
gevent:               23.9.1
hvplot:               <not installed>
matplotlib:           3.8.0
nest_asyncio:         1.5.8
numpy:                1.24.1
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              14.0.1
pydantic:             2.4.2
pyiceberg:            0.5.0
pyxlsb:               <not installed>
sqlalchemy:           2.0.22
torch:                2.1.0+cu121
xlsx2csv:             0.8.1
xlsxwriter:           3.1.8

The text was updated successfully, but these errors were encountered:

owenprough-sift · 2024-05-15T12:46:11Z

Probably related to the discussion in #15959. And possibly a duplicate of #10680?

mesner · 2024-05-15T12:48:07Z

Yes, likely. Sorry, I didn't see #10680

mesner added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 15, 2024

mesner closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

mesner commented May 15, 2024

owenprough-sift commented May 15, 2024

mesner commented May 15, 2024

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

Comments

mesner commented May 15, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

owenprough-sift commented May 15, 2024

mesner commented May 15, 2024