Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values being shifted around in pl.write_parquet or pl.read_parquet between rows (VERY BAD!) #16109

Closed
2 tasks done
DeflateAwning opened this issue May 7, 2024 · 5 comments
Closed
2 tasks done
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-critical Priority: critical python Related to Python Polars regression Issue introduced by a new release

Comments

@DeflateAwning
Copy link
Contributor

DeflateAwning commented May 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import random
import polars as pl

df1 = pl.DataFrame(
	{
		"a": [random.randint(1, 5) for _ in range(100_000)],
		"b": [random.choice(['123', 'abc', 'xyz', '129010o']) for _ in range(100_000)],
		"c": [random.choice(['123', 'abc', 'xyz']) for _ in range(100_000)],
	}
)

df1.write_parquet('temp.parquet')
print('Done writing temp.parquet')
print(df1)

df2 = pl.read_parquet('temp.parquet')
print(df2)

# would not expect df.eq(df) to return True, because the order is not required to match
print(f"df1 equals df2, before sort? {df1.equals(df2)} (okay if it's False)")

df1 = df1.sort(df1.columns)
df2 = df2.sort(df2.columns)

assert df1.equals(df2), 'DataFrames contents are not equal'

Log output

No response

Issue description

A basic dataframe, written to a parquet, and then read back, should equal the original dataframe (especially when sorted).

This is currently not the case.

Any data lake using parquet files relying on this library is broken if they upgraded. This needs to be added as a unit test.

This broke when upgrading to v0.20.24 from v0.20.23.

Expected behavior

A basic dataframe, written to a parquet, and then read back, should equal the original dataframe (especially when sorted).

Installed versions

--------Version info---------
Polars:               0.20.24
Index type:           UInt32
Platform:             Linux-6.5.0-1022-oem-x86_64-with-glibc2.35
Python:               3.9.19 (main, Apr  6 2024, 17:57:55) 
[GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  0.11.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               1.5.3
pyarrow:              11.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           1.4.52
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.0.9


@DeflateAwning DeflateAwning added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 7, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented May 7, 2024

Can reproduce.

It seems to be non-deterministic for me, but always happens within a few runs.

import tempfile
import polars as pl

with tempfile.NamedTemporaryFile() as f:
    for n in range(10):
        print(f"Run #{n + 1}: ", end="")

        df = pl.DataFrame({
            "a": pl.Series(["123", "abc", "xyz"]).sample(50_000, with_replacement=True)
        }).with_row_index()

        df.write_parquet(f.name)
        f.seek(0)

        assert df.equals(pl.read_parquet(f.name))
        print("OK!")
# Run #1: OK!
# Run #2: OK!
# Run #3: 
# AssertionError

I imagine it will be labelled as P-high when seen by the team - but tagging @ritchie46 seems reasonable here.

@DeflateAwning
Copy link
Contributor Author

I'd suggest that v0.20.24 should maybe be yanked, especially after this is patched. This is about as bad a bug as you can get - a difficult to detect bug that degrades data, with no indication. This falls dangerously close to the security vulnerability category of bug, imo.

@nameexhaustion nameexhaustion added accepted Ready for implementation P-critical Priority: critical A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels May 8, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 8, 2024
@nameexhaustion
Copy link
Collaborator

Thanks for the report. Silent data mutation can be quite awful, will take a look.

@nameexhaustion nameexhaustion self-assigned this May 8, 2024
@nameexhaustion nameexhaustion added the regression Issue introduced by a new release label May 8, 2024
@ritchie46
Copy link
Member

I'd suggest that v0.20.24 should maybe be yanked, especially after this is patched. This is about as bad a bug as you can get - a difficult to detect bug that degrades data, with no indication. This falls dangerously close to the security vulnerability category of bug, imo.

I will patch and yank

@ritchie46
Copy link
Member

fixed by #16113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-critical Priority: critical python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
Development

No branches or pull requests

4 participants