Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Keep highest time unit precision in arithmetic between Datetime/Duration #12086

Closed
wants to merge 15 commits into from

Conversation

Giuzzilla
Copy link

@Giuzzilla Giuzzilla commented Oct 28, 2023

Closes #12023

Note that this would change the current behavior which always "downgrades" the TimeUnit to either milliseconds or microseconds (The reported issue happens also when mixing microseconds with milliseconds)

@github-actions github-actions bot added fix Bug fix rust Related to Rust Polars labels Oct 28, 2023
@Giuzzilla
Copy link
Author

Giuzzilla commented Oct 28, 2023

Currently failing tests:

  • test_fill_null_temporal (uses fill_null with 'ns' literals in 'ms' column and expects non-'ns') in test_temporal.py
  • test_lazy_query_10 (subtracts nanoseconds from milliseconds and expects milliseconds) in queries.rs

@MarcoGorelli MarcoGorelli self-requested a review October 28, 2023 21:09
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Oct 29, 2023

Thanks for your PR!

test_fill_null_temporal

I think this test needs updating.

On the latest release we see:

In [4]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [5]: df
Out[5]:
shape: (1, 1)
┌──────────────┐
│ ts           │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

In [6]: df.fill_null(datetime(2020, 1, 1))
Out[6]:
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which doesn't look right (why does fill_null upcast?), whereas on your branch one gets:

In [1]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [2]: df.fill_null(datetime(2020, 1, 1))
Out[2]: 
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which is what I'd have expected

So, I think it would be correct to update the failing tests here

Comment on lines 554 to 563
pub fn get_time_units(tu_l: &TimeUnit, tu_r: &TimeUnit) -> TimeUnit {
use TimeUnit::*;
match (tu_l, tu_r) {
(Nanoseconds, Microseconds) => Microseconds,
(_, Milliseconds) => Milliseconds,
(_, Nanoseconds) => Nanoseconds,
(Milliseconds, Microseconds) => Microseconds,
_ => *tu_l,
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this always results in the highest precision between tu_l and tu_r, right?

mn -> n
un -> n
nn -> n
mu -> u
uu -> u
nu -> n
mm -> m
um -> u
nm -> n

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

9 possible variants (3*3).

3 of them are L=R which is a trivial case.

The variants in which L != R (6):
If Nanoseconds on the right result in Nanoseconds (first match case) = 2 cases
If Nanoseconds on the left result in Nanoseconds (last match case) = 2 cases
ms, us => ms (second match case)
us, ms => us (last match case)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 variants isn't a huge number, maybe this is easier to reason about if you just write them all out?

Copy link
Author

@Giuzzilla Giuzzilla Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli added a new commit to explicitly list all cases. If anyone has a different opinion on that I can revert it easily

@@ -143,6 +143,51 @@ def test_add_duration_3786() -> None:
}


def test_datetime_ms_add_duration_us() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you've made two tests:

  • ms and us
  • us and ns

why not parametrise over the time units?

so you'd have something like

@pytest.mark.parametrize(
    ('left_time_unit', 'right_time_unit', 'expected_time_unit'),
    [
        ('ms', 'ms', 'ms'),
        ....
    ]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli

I've parametrised the test with all the possible time units combinations (had to add some helper dictionaries to get the correct multipliers / parameters in each case).

I've made it to have two cases: a duration added on the right side and vice versa, i.e. datetime + duration & duration + datetime, given that the operations are not "perfectly" commutative. Could be a bit out of scope though (let me know if you think that's too overkill).

@Giuzzilla
Copy link
Author

Giuzzilla commented Oct 30, 2023

Thanks for your PR!

test_fill_null_temporal

I think this test needs updating.

On the latest release we see:

In [4]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [5]: df
Out[5]:
shape: (1, 1)
┌──────────────┐
│ ts           │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

In [6]: df.fill_null(datetime(2020, 1, 1))
Out[6]:
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which doesn't look right (why does fill_null upcast?), whereas on your branch one gets:

In [1]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [2]: df.fill_null(datetime(2020, 1, 1))
Out[2]: 
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which is what I'd have expected

So, I think it would be correct to update the failing tests here

Updated both failing tests to make them pass

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code changes look good, just a little concerned about the complexity in the tests (especially logic like expected = dt_value * dt_multiplier + offset - wondering if there's a simpler/shorter way of writing them with less logic?

py-polars/tests/unit/namespaces/test_datetime.py Outdated Show resolved Hide resolved
Comment on lines 173 to 193
verbose_duration_unit = {
"ms": "milliseconds",
"us": "microseconds",
"ns": "nanoseconds",
}[duration_time_unit]

duration = pl.duration(
**{verbose_duration_unit: 1},
time_unit=duration_time_unit,
)

df_add_right = df_datetimes.select((pl.col("ts") + duration).alias("ts"))
df_add_left = df_datetimes.select((duration + pl.col("ts")).alias("ts"))

multipliers = {
"ms": 1_000_000,
"us": 1_000,
"ns": 1,
}
dt_multiplier = multipliers[dt_time_unit]
offset = multipliers[duration_time_unit]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking quite complicated for a test, is there a simpler way to write it?

@Giuzzilla
Copy link
Author

@MarcoGorelli I will fiddle with it a bit more.

One simple way could be to get rid of (some?) parametrisation and be more verbose with more (explicit) tests.

Sadly when you mix these timeunits freely you need a bit more creativity, I couldn't find an easier way to get them on the same ground.

@Giuzzilla
Copy link
Author

Giuzzilla commented Oct 31, 2023

@MarcoGorelli I've tried to simplify the tests further:

  • Made more explicit the mapping between the literals & the nanosecond conversion
  • Getting nanoseconds directly from duration instead of calculating it through the multiplier (so we can get rid of the multiplier dict altogether).
  • Made an explicit parameter for add-left vs add-right in the first test (less code repetition)
  • Got rid of made-up terms like "offset", which was indeed confusing

@Giuzzilla
Copy link
Author

Giuzzilla commented Oct 31, 2023

For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental

@MarcoGorelli
Copy link
Collaborator

For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental

It's really good that you've included this, thanks!

I've just pushed a commit to further remove some logic from the test (including a hand-calculated expected_nanoseconds), and have used assert_frame_equal - fancy making a similar change to the test you added for subtraction between datetimes?

@Giuzzilla
Copy link
Author

For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental

It's really good that you've included this, thanks!

I've just pushed a commit to further remove some logic from the test (including a hand-calculated expected_nanoseconds), and have used assert_frame_equal - fancy making a similar change to the test you added for subtraction between datetimes?

Your changes look great. I've done the same thing on the subtraction of datetimes. I've kept the subtraction operation explicit in the parameters there just to make it clear where those numbers come from

@Giuzzilla Giuzzilla marked this pull request as ready for review October 31, 2023 22:25
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks @Giuzzilla !

@Giuzzilla Giuzzilla force-pushed the fix-timeunit-downgrading branch from ef90999 to e7770fe Compare November 27, 2023 12:49
@MarcoGorelli
Copy link
Collaborator

@ritchie46 gentle ping on this one (I think it should be ready, and fixes the linked issue nicely. The gist of the change is in crates/polars-core/src/utils/mod.rs, the rest is updating tests)

@Giuzzilla Giuzzilla force-pushed the fix-timeunit-downgrading branch from e7770fe to 96600d3 Compare January 8, 2024 17:25
@Giuzzilla Giuzzilla requested a review from c-peters as a code owner January 8, 2024 17:25
@stinodego stinodego changed the title fix(rust): us/ns TimeUnits downgraded to ms/us on arithmetic ops between Datetime/durations fix: us/ns TimeUnits downgraded to ms/us on arithmetic ops between Datetime/durations Feb 13, 2024
@github-actions github-actions bot added the python Related to Python Polars label Feb 13, 2024
@stinodego stinodego force-pushed the fix-timeunit-downgrading branch from 5138f20 to d05b0b4 Compare February 13, 2024 10:31
@stinodego stinodego changed the title fix: us/ns TimeUnits downgraded to ms/us on arithmetic ops between Datetime/durations fix: Keep highest time unit precision in arithmetic between Datetime/Duration Feb 13, 2024
Copy link
Contributor

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some minor code cleanup.

However, I don't think this can be merged in its current form. The reason that supertypes were defined as the lowest precision among time units is because of overflows:

import polars as pl

df = pl.DataFrame(
    {"datetime": [0], "duration": [2**60]},
    schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
)

result = df.select(pl.col("datetime") + pl.col("duration"))

print(result.dtypes[0].time_unit)
print(result.cast(pl.Int64))

If you run this on main, you get the correct result (a Datetime of type ms). In this PR, you get 0 as a result, which is clearly wrong.

So this PR must be updated to correctly handle overflows (error?). Or we should keep the existing behavior.

@stinodego stinodego force-pushed the fix-timeunit-downgrading branch from 2e7afa4 to e5bfc7a Compare February 13, 2024 11:08
@stinodego
Copy link
Contributor

Thanks @Giuzzilla but we want to keep the existing behavior. See #12023 (comment) for explanation.

@stinodego stinodego closed this Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding duration changes "ns" time_unit to "us" (and loses info)
3 participants