Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Return the truth values of ne_missing and eq_missing operations for struct instead of null #18930

Merged
merged 9 commits into from
Sep 29, 2024
Merged

fix: Return the truth values of ne_missing and eq_missing operations for struct instead of null #18930

merged 9 commits into from
Sep 29, 2024

Conversation

barak1412
Copy link
Contributor

@barak1412 barak1412 commented Sep 25, 2024

Fixes #18870.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Sep 25, 2024
@barak1412
Copy link
Contributor Author

I saw the failures, will work on a fix.

@ritchie46 ritchie46 changed the title fix: Return the trurth values of ne_missing and eq_missing operations for struct instead of null fix: Return the truth values of ne_missing and eq_missing operations for struct instead of null Sep 26, 2024
@barak1412
Copy link
Contributor Author

barak1412 commented Sep 26, 2024

@ritchie46 I found the problem - the root of the problem is not in my code:
Before my change, the function assert_frame_equal was buggy due to the issue this PR attempts to solve.

For instance, this code won't fail:

import polars as pl
from polars.testing import assert_frame_equal


df1 = pl.DataFrame({
    "a": [None, {"foo": 0, "bar": "1"}],
})
df2 = pl.DataFrame({
    "a": [{"foo": 0, "bar": "1"}, {"foo": 0, "bar": "1"}],
})

assert_frame_equal(df1, df2)

So although the code in main branch passes, there is a bug in when().then().else() expression.
Do you want me to try solve the issue in this PR or open a new issue (and temporary remove the struct from the failed test) and try to solve it first?

EDIT:
I need some further investigation, although I am sure assert_frame_equal has bug due to this PR issue.

@barak1412
Copy link
Contributor Author

It looks like a regression bug (took code from my failed test):

import polars as pl


dtype = pl.Struct({"foo": pl.Int32, "bar": pl.String})
len = 2

broadcast = (False, False, True)
mask = [False, True]
if_true = [None, {'foo': 0, 'bar': '1'}]
if_false = [{'foo': 0, 'bar': '1'}, {'foo': 0, 'bar': '1'}]

py_mask, py_true, py_false = (
    [c[0]] * len if b else c
    for b, c in zip(broadcast, [mask, if_true, if_false])
)
pl_mask, pl_true, pl_false = (
    c.first() if b else c
    for b, c in zip(broadcast, [pl.col.mask, pl.col.if_true, pl.col.if_false])
)

expected = pl.DataFrame(
    {"if_true": [t if m else f for m, t, f in zip(py_mask, py_true, py_false)]},
    schema={"if_true": dtype},
)
df = pl.DataFrame(
    {
        "mask": mask,
        "if_true": if_true,
        "if_false": if_false,
    },
    schema={"mask": pl.Boolean, "if_true": dtype, "if_false": dtype},
)

ans = df.select(pl.when(pl.col.mask).then(pl.col.if_true).otherwise(pl.col.if_false.first()))

print(expected)
print(ans)

On main branch:

shape: (2, 1)
┌───────────┐
│ if_true   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {0,"1"}   │
│ {0,"1"}   │
└───────────┘
shape: (2, 1)
┌───────────┐
│ if_true   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ null      │
│ {0,"1"}   │
└

On 1.7.0:

shape: (2, 1)
┌───────────┐
│ if_true   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {0,"1"}   │
│ {0,"1"}   │
└───────────┘
shape: (2, 1)
┌───────────┐
│ if_true   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {0,"1"}   │
│ {0,"1"}   │
└───────────┘

Can someone reproduce?

@barak1412
Copy link
Contributor Author

To wrap all things up, this is what happend:

  • Since some version after 1.7.0 and current main, a regression bug has occured in when().then().otherwise() expression.
  • Due to a bug in ne_missing which influences assert_frame_equal, the regression bug wasn't exposed.
  • When this PR fixed the ne_missing bug, assert_frame_equal was also fixed and the regression bug was exposed and made the test fail.

I plan tommorow to open an issue of the regression bug, and when it will get fixed (I hope I will have time to), come back to this PR.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Sep 26, 2024

I can reproduce the test behaviour.

On 1.8.1 I get {0,"1"}

On 1.8.2 I get null

Update: This seems to be where the behaviour changes: #18886

It seems there is a problem once you add another expression e.g. first/last/etc

df = pl.DataFrame({"foo": [dict(a=1), dict(a=1)]})

df.select(pl.col.foo == pl.col.foo)
# shape: (2, 1)
# ┌──────┐
# │ foo  │
# │ ---  │
# │ bool │
# ╞══════╡
# │ true │
# │ true │
# └──────┘
df.select(pl.col.foo == pl.col.foo.last())
# shape: (2, 1)
# ┌───────┐
# │ foo   │
# │ ---   │
# │ bool  │
# ╞═══════╡
# │ false │
# │ false │
# └───────┘

@ritchie46
Copy link
Member

@coastalwhite can you take a look?

Copy link

codecov bot commented Sep 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.88%. Comparing base (d0963db) to head (9b45bbd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18930      +/-   ##
==========================================
- Coverage   79.89%   79.88%   -0.01%     
==========================================
  Files        1524     1524              
  Lines      207655   207661       +6     
  Branches     2904     2904              
==========================================
- Hits       165898   165896       -2     
- Misses      41210    41218       +8     
  Partials      547      547              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -801,6 +804,7 @@ impl ChunkCompareEq<&StructChunked> for StructChunked {
|l, r| l.equal(r).unwrap(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is not what this PR is trying to solve, but we made a policy that nested equality is always eq_missing. Maybe, this should changed here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not all types support eq_missing, so the code reaches panic.
Maybe it is better to get fixed in separate PR?

@@ -759,6 +759,7 @@ fn struct_helper<F, R>(
op: F,
reduce: R,
value: bool,
apply_null_validity: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is named is_missing in other helpers. Maybe that should be here as well. Not really a blocker though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will rename to is_missing

@ritchie46
Copy link
Member

Thank you @barak1412 and thanks for the review @coastalwhite

@ritchie46
Copy link
Member

Failure is unrelated.

@ritchie46 ritchie46 merged commit c23266b into pola-rs:main Sep 29, 2024
24 of 25 checks passed
@barak1412
Copy link
Contributor Author

@ritchie46 Thanks! (by the way make pre-commit is broken?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.eq_missing() returns null instead of bool for structs
4 participants