Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join fails for scanned lazyframes when streaming=True #18820

Open
2 tasks done
mikkelfo opened this issue Sep 18, 2024 · 2 comments
Open
2 tasks done

Join fails for scanned lazyframes when streaming=True #18820

mikkelfo opened this issue Sep 18, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@mikkelfo
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When using scan_parquet and .join with large lazyframes, .collect(streaming=True) fails to return the correct joined lazyframe, which is correctly returned when using collect(streaming=False). slice_pushdown=False fixes it for "smaller" large lazyframes (seemingly <2.6million), but the issue persistent for bigger ones. Similar issues happen for scan_csv, but slice_pushdown=False fixes it completely.

N = 500_000
ids = list(range(N))

pl.DataFrame({"id": ids}).write_parquet("ids.parquet")
lf = pl.scan_parquet("ids.parquet")
unique_ids = lf.select("id").unique()

chunk_size = 100_000
for i in range(0, N, chunk_size):
    chunk_ids = unique_ids[i: i + chunk_size]
    chunk = lf.join(chunk_ids, on="id").collect(streaming=True)
    chunk_no_streaming = lf.join(chunk_ids, on="id").collect(streaming=False)

    # Get lengths of chunk_ids, chunk and chunk (no streaming)
    N_chunk_ids = len(chunk_ids.collect())
    N_chunk = len(chunk.select("id").unique())
    N_chunk_no_streaming = len(chunk_no_streaming.select("id").unique())
    print(N_chunk_ids, N_chunk, N_chunk_no_streaming)

Returning:
100000 100000 100000
100000 93484 100000
100000 95666 100000
100000 96602 100000
100000 98811 100000

The first chunk always seems to work properly for streaming=True, but the subsequent fail.

Setting slice_pushdown=False solves the issue for "small" large samples, but increasing N to 3_000_000 results in latter chunks being empty, seemingly around ~2.7 million ids no matter how big N is.
Returning:
100000 100000 100000
100000 100000 100000
.......................................
100000 100000 100000
100000 27270 100000
100000 0 100000
100000 0 100000

scan_csv

When using scan_csv files instead of scan_parquet, collect(streaming=True) shows the same issues, but setting slice_pushdown=False seems to fix it completely, making it work as expected.

Log output

No response

Issue description

When using scan_parquet and .join with large lazyframes, .collect(streaming=True) fails to return the correct joined lazyframe, which is correctly returned when using collect(streaming=False). slice_pushdown=False fixes it for "smaller" large lazyframes (seemingly <2.6million), but the issue persistent for bigger ones. Similar issues happen for scan_csv, but slice_pushdown=False fixes it completely.

Expected behavior

Expect .collect(streaming=True)and .collect(streaming=False) to be equal

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            macOS-14.6.1-arm64-arm-64bit
Python:              3.10.7 (v3.10.7:6cc6b13308, Sep  5 2022, 14:02:52) [Clang 13.0.0 (clang-1300.0.29.30)]
@mikkelfo mikkelfo added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 18, 2024
@mikkelfo mikkelfo changed the title Join fails for scanned lazyframes when collect(streaming=True) Join fails for scanned lazyframes when streaming=True Sep 18, 2024
@ritchie46
Copy link
Member

I believe this is already fixed on main. Does this also occur on 1.6?

@mikkelfo
Copy link
Author

mikkelfo commented Sep 19, 2024

I believe this is already fixed on main. Does this also occur on 1.6?

@ritchie46 Polars 1.6 and slice_pushdown=False fixes the issue, but slice_pushdown=True (default) still has the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants