Values being shifted around in `pl.write_parquet` or `pl.read_parquet` between rows (VERY BAD!) #16109

DeflateAwning · 2024-05-07T22:37:37Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import random
import polars as pl

df1 = pl.DataFrame(
	{
		"a": [random.randint(1, 5) for _ in range(100_000)],
		"b": [random.choice(['123', 'abc', 'xyz', '129010o']) for _ in range(100_000)],
		"c": [random.choice(['123', 'abc', 'xyz']) for _ in range(100_000)],
	}
)

df1.write_parquet('temp.parquet')
print('Done writing temp.parquet')
print(df1)

df2 = pl.read_parquet('temp.parquet')
print(df2)

# would not expect df.eq(df) to return True, because the order is not required to match
print(f"df1 equals df2, before sort? {df1.equals(df2)} (okay if it's False)")

df1 = df1.sort(df1.columns)
df2 = df2.sort(df2.columns)

assert df1.equals(df2), 'DataFrames contents are not equal'

Log output

No response

Issue description

A basic dataframe, written to a parquet, and then read back, should equal the original dataframe (especially when sorted).

This is currently not the case.

Any data lake using parquet files relying on this library is broken if they upgraded. This needs to be added as a unit test.

This broke when upgrading to v0.20.24 from v0.20.23.

Expected behavior

A basic dataframe, written to a parquet, and then read back, should equal the original dataframe (especially when sorted).

Installed versions

--------Version info---------
Polars:               0.20.24
Index type:           UInt32
Platform:             Linux-6.5.0-1022-oem-x86_64-with-glibc2.35
Python:               3.9.19 (main, Apr  6 2024, 17:57:55) 
[GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  0.11.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               1.5.3
pyarrow:              11.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           1.4.52
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.0.9

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-05-07T22:58:43Z

Can reproduce.

It seems to be non-deterministic for me, but always happens within a few runs.

import tempfile
import polars as pl

with tempfile.NamedTemporaryFile() as f:
    for n in range(10):
        print(f"Run #{n + 1}: ", end="")

        df = pl.DataFrame({
            "a": pl.Series(["123", "abc", "xyz"]).sample(50_000, with_replacement=True)
        }).with_row_index()

        df.write_parquet(f.name)
        f.seek(0)

        assert df.equals(pl.read_parquet(f.name))
        print("OK!")

# Run #1: OK!
# Run #2: OK!
# Run #3: 
# AssertionError

I imagine it will be labelled as P-high when seen by the team - but tagging @ritchie46 seems reasonable here.

DeflateAwning · 2024-05-08T00:13:07Z

I'd suggest that v0.20.24 should maybe be yanked, especially after this is patched. This is about as bad a bug as you can get - a difficult to detect bug that degrades data, with no indication. This falls dangerously close to the security vulnerability category of bug, imo.

nameexhaustion · 2024-05-08T03:27:15Z

Thanks for the report. Silent data mutation can be quite awful, will take a look.

ritchie46 · 2024-05-08T08:22:52Z

I'd suggest that v0.20.24 should maybe be yanked, especially after this is patched. This is about as bad a bug as you can get - a difficult to detect bug that degrades data, with no indication. This falls dangerously close to the security vulnerability category of bug, imo.

I will patch and yank

ritchie46 · 2024-05-08T10:06:19Z

fixed by #16113

DeflateAwning added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 7, 2024

nameexhaustion added accepted Ready for implementation P-critical Priority: critical A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels May 8, 2024

github-project-automation bot added this to Backlog May 8, 2024

github-project-automation bot moved this to Ready in Backlog May 8, 2024

nameexhaustion self-assigned this May 8, 2024

nameexhaustion added the regression Issue introduced by a new release label May 8, 2024

ritchie46 closed this as completed May 8, 2024

github-project-automation bot moved this from Ready to Done in Backlog May 8, 2024

ritchie46 mentioned this issue May 8, 2024

feat(rust): Add RLE to RLE_DICTIONARY encoder #15959

Merged

DeflateAwning mentioned this issue May 8, 2024

Add tests for writing-then-reading randomly-generated dataframes #16121

Open

thalassemia mentioned this issue May 9, 2024

feat(rust,python): Add run-length encoding to Parquet writer #16125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Values being shifted around in `pl.write_parquet` or `pl.read_parquet` between rows (VERY BAD!) #16109

Values being shifted around in `pl.write_parquet` or `pl.read_parquet` between rows (VERY BAD!) #16109

DeflateAwning commented May 7, 2024 •

edited

Loading

cmdlineluser commented May 7, 2024 •

edited

Loading

DeflateAwning commented May 8, 2024

nameexhaustion commented May 8, 2024

ritchie46 commented May 8, 2024

ritchie46 commented May 8, 2024

Values being shifted around in pl.write_parquet or pl.read_parquet between rows (VERY BAD!) #16109

Values being shifted around in pl.write_parquet or pl.read_parquet between rows (VERY BAD!) #16109

Comments

DeflateAwning commented May 7, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented May 7, 2024 • edited Loading

DeflateAwning commented May 8, 2024

nameexhaustion commented May 8, 2024

ritchie46 commented May 8, 2024

ritchie46 commented May 8, 2024

Values being shifted around in `pl.write_parquet` or `pl.read_parquet` between rows (VERY BAD!) #16109

Values being shifted around in `pl.write_parquet` or `pl.read_parquet` between rows (VERY BAD!) #16109

DeflateAwning commented May 7, 2024 •

edited

Loading

cmdlineluser commented May 7, 2024 •

edited

Loading