-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Keep highest time unit precision in arithmetic between Datetime
/Duration
#12086
Conversation
Currently failing tests:
|
Thanks for your PR!
I think this test needs updating. On the latest release we see: In [4]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})
In [5]: df
Out[5]:
shape: (1, 1)
┌──────────────┐
│ ts │
│ --- │
│ datetime[ns] │
╞══════════════╡
│ null │
└──────────────┘
In [6]: df.fill_null(datetime(2020, 1, 1))
Out[6]:
shape: (1, 1)
┌─────────────────────┐
│ ts │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘ which doesn't look right (why does fill_null upcast?), whereas on your branch one gets: In [1]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})
In [2]: df.fill_null(datetime(2020, 1, 1))
Out[2]:
shape: (1, 1)
┌─────────────────────┐
│ ts │
│ --- │
│ datetime[ns] │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘ which is what I'd have expected So, I think it would be correct to update the failing tests here |
crates/polars-core/src/utils/mod.rs
Outdated
pub fn get_time_units(tu_l: &TimeUnit, tu_r: &TimeUnit) -> TimeUnit { | ||
use TimeUnit::*; | ||
match (tu_l, tu_r) { | ||
(Nanoseconds, Microseconds) => Microseconds, | ||
(_, Milliseconds) => Milliseconds, | ||
(_, Nanoseconds) => Nanoseconds, | ||
(Milliseconds, Microseconds) => Microseconds, | ||
_ => *tu_l, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this always results in the highest precision between tu_l
and tu_r
, right?
mn -> n
un -> n
nn -> n
mu -> u
uu -> u
nu -> n
mm -> m
um -> u
nm -> n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,
9 possible variants (3*3).
3 of them are L=R which is a trivial case.
The variants in which L != R (6):
If Nanoseconds on the right result in Nanoseconds (first match case) = 2 cases
If Nanoseconds on the left result in Nanoseconds (last match case) = 2 cases
ms, us => ms (second match case)
us, ms => us (last match case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 variants isn't a huge number, maybe this is easier to reason about if you just write them all out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli added a new commit to explicitly list all cases. If anyone has a different opinion on that I can revert it easily
@@ -143,6 +143,51 @@ def test_add_duration_3786() -> None: | |||
} | |||
|
|||
|
|||
def test_datetime_ms_add_duration_us() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you've made two tests:
- ms and us
- us and ns
why not parametrise over the time units?
so you'd have something like
@pytest.mark.parametrize(
('left_time_unit', 'right_time_unit', 'expected_time_unit'),
[
('ms', 'ms', 'ms'),
....
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've parametrised the test with all the possible time units combinations (had to add some helper dictionaries to get the correct multipliers / parameters in each case).
I've made it to have two cases: a duration added on the right side and vice versa, i.e. datetime + duration & duration + datetime, given that the operations are not "perfectly" commutative. Could be a bit out of scope though (let me know if you think that's too overkill).
Updated both failing tests to make them pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code changes look good, just a little concerned about the complexity in the tests (especially logic like expected = dt_value * dt_multiplier + offset
- wondering if there's a simpler/shorter way of writing them with less logic?
verbose_duration_unit = { | ||
"ms": "milliseconds", | ||
"us": "microseconds", | ||
"ns": "nanoseconds", | ||
}[duration_time_unit] | ||
|
||
duration = pl.duration( | ||
**{verbose_duration_unit: 1}, | ||
time_unit=duration_time_unit, | ||
) | ||
|
||
df_add_right = df_datetimes.select((pl.col("ts") + duration).alias("ts")) | ||
df_add_left = df_datetimes.select((duration + pl.col("ts")).alias("ts")) | ||
|
||
multipliers = { | ||
"ms": 1_000_000, | ||
"us": 1_000, | ||
"ns": 1, | ||
} | ||
dt_multiplier = multipliers[dt_time_unit] | ||
offset = multipliers[duration_time_unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking quite complicated for a test, is there a simpler way to write it?
@MarcoGorelli I will fiddle with it a bit more. One simple way could be to get rid of (some?) parametrisation and be more verbose with more (explicit) tests. Sadly when you mix these timeunits freely you need a bit more creativity, I couldn't find an easier way to get them on the same ground. |
@MarcoGorelli I've tried to simplify the tests further:
|
For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental |
It's really good that you've included this, thanks! I've just pushed a commit to further remove some logic from the test (including a hand-calculated |
Your changes look great. I've done the same thing on the subtraction of datetimes. I've kept the subtraction operation explicit in the parameters there just to make it clear where those numbers come from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks @Giuzzilla !
ef90999
to
e7770fe
Compare
@ritchie46 gentle ping on this one (I think it should be ready, and fixes the linked issue nicely. The gist of the change is in |
e7770fe
to
96600d3
Compare
5138f20
to
d05b0b4
Compare
Datetime
/Duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some minor code cleanup.
However, I don't think this can be merged in its current form. The reason that supertypes were defined as the lowest precision among time units is because of overflows:
import polars as pl
df = pl.DataFrame(
{"datetime": [0], "duration": [2**60]},
schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
)
result = df.select(pl.col("datetime") + pl.col("duration"))
print(result.dtypes[0].time_unit)
print(result.cast(pl.Int64))
If you run this on main
, you get the correct result (a Datetime of type ms
). In this PR, you get 0
as a result, which is clearly wrong.
So this PR must be updated to correctly handle overflows (error?). Or we should keep the existing behavior.
2e7afa4
to
e5bfc7a
Compare
…een Datetime/durations
Expecting us representation, which is the max precision of a python datetime
e5bfc7a
to
4ca271c
Compare
Thanks @Giuzzilla but we want to keep the existing behavior. See #12023 (comment) for explanation. |
Closes #12023
Note that this would change the current behavior which always "downgrades" the TimeUnit to either milliseconds or microseconds (The reported issue happens also when mixing microseconds with milliseconds)