Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list.set_intersection operates on the wrong columns when multiple columns are selected with pl.col #18795

Open
2 tasks done
caleb-lindgren opened this issue Sep 17, 2024 · 1 comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@caleb-lindgren
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.LazyFrame({
	"a": [
		[1, 2],
		[1, 2],
	],
	"b": [
		[1, 3],
		[2, 3],
	],
	"c": [
		[1, 2],
		[1, 2],
	]
})

separate = (
	df.with_columns(pl.col("a").list.set_intersection("c"))
	.with_columns(pl.col("b").list.set_intersection("c"))
)

together = (
	df.with_columns(pl.col("a", "b").list.set_intersection("c"))
)

# Print the base dataframe
print(df.collect())

print("-" * 50)

# Print the result of performing the operations separately
print(separate)
print(separate.collect())

print("-" * 50)

# Print the result of trying to perform the operations together
print(together)
print(together.collect())

Output:

shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2]    ┆ [1, 3]    ┆ [1, 2]    │
│ [1, 2]    ┆ [2, 3]    ┆ [1, 2]    │
└───────────┴───────────┴───────────┘
--------------------------------------------------
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [col("b").list.intersection([col("c")])]
   WITH_COLUMNS:
   [col("a").list.intersection([col("c")])]
    DF ["a", "b", "c"]; PROJECT */3 COLUMNS; SELECTION: None
shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2]    ┆ [1]       ┆ [1, 2]    │
│ [1, 2]    ┆ [2]       ┆ [1, 2]    │
└───────────┴───────────┴───────────┘
--------------------------------------------------
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [col("a").list.intersection([col("b"), col("c")])]
  DF ["a", "b", "c"]; PROJECT */3 COLUMNS; SELECTION: None
shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1]       ┆ [1, 3]    ┆ [1, 2]    │
│ [2]       ┆ [2, 3]    ┆ [1, 2]    │
└───────────┴───────────┴───────────┘

Log output

No response

Issue description

I am trying to replace a with a ^ c and replace b with b ^ c. If I perform these two operations separately, it works. However, if I try to use pl.col to select both a and b and perform both operations at the same time, instead a is replaced with a ^ b ^ c and nothing happens to b.

Expected behavior

When I use pl.col to select both a and b and perform on each the set interaction with c, it should behave the same as when I compute the two intersections separately.

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-6.10.9-artix1-2-x86_64-with-glibc2.40
Python:              3.11.5 (main, Oct 18 2023, 09:37:15) [GCC 13.2.1 20230801]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.1.2
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.8.0
nest_asyncio         1.5.8
numpy                1.26.1
openpyxl             3.1.2
pandas               2.1.1
pyarrow              16.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@caleb-lindgren caleb-lindgren added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 17, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

It may be easier to see the bug using select - as b disappears.

df.select(pl.col("a", "b").list.set_intersection("c")).collect()
# shape: (2, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ [1]       │
# │ [2]       │
# └───────────┘

It also happens for DataFrames, so doesn't appear to be an optimizer issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants