Join fails for scanned lazyframes when streaming=True
#18820
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
When using
scan_parquet
and.join
with large lazyframes,.collect(streaming=True)
fails to return the correct joined lazyframe, which is correctly returned when usingcollect(streaming=False)
.slice_pushdown=False
fixes it for "smaller" large lazyframes (seemingly <2.6million), but the issue persistent for bigger ones. Similar issues happen forscan_csv
, butslice_pushdown=False
fixes it completely.Returning:
100000 100000 100000
100000 93484 100000
100000 95666 100000
100000 96602 100000
100000 98811 100000
The first chunk always seems to work properly for
streaming=True
, but the subsequent fail.Setting
slice_pushdown=False
solves the issue for "small" large samples, but increasingN
to3_000_000
results in latter chunks being empty, seemingly around ~2.7 million ids no matter how bigN
is.Returning:
100000 100000 100000
100000 100000 100000
.......................................
100000 100000 100000
100000 27270 100000
100000 0 100000
100000 0 100000
scan_csv
When using
scan_csv
files instead ofscan_parquet
,collect(streaming=True)
shows the same issues, but settingslice_pushdown=False
seems to fix it completely, making it work as expected.Log output
No response
Issue description
When using
scan_parquet
and.join
with large lazyframes,.collect(streaming=True)
fails to return the correct joined lazyframe, which is correctly returned when usingcollect(streaming=False)
.slice_pushdown=False
fixes it for "smaller" large lazyframes (seemingly <2.6million), but the issue persistent for bigger ones. Similar issues happen forscan_csv
, butslice_pushdown=False
fixes it completely.Expected behavior
Expect
.collect(streaming=True)
and.collect(streaming=False)
to be equalInstalled versions
The text was updated successfully, but these errors were encountered: