fix: Keep highest time unit precision in arithmetic between `Datetime`/`Duration` #12086

Giuzzilla · 2023-10-28T18:27:30Z

Closes #12023

Note that this would change the current behavior which always "downgrades" the TimeUnit to either milliseconds or microseconds (The reported issue happens also when mixing microseconds with milliseconds)

Giuzzilla · 2023-10-28T18:28:00Z

Currently failing tests:

test_fill_null_temporal (uses fill_null with 'ns' literals in 'ms' column and expects non-'ns') in test_temporal.py
test_lazy_query_10 (subtracts nanoseconds from milliseconds and expects milliseconds) in queries.rs

MarcoGorelli · 2023-10-29T18:20:52Z

Thanks for your PR!

test_fill_null_temporal

I think this test needs updating.

On the latest release we see:

In [4]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [5]: df
Out[5]:
shape: (1, 1)
┌──────────────┐
│ ts           │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

In [6]: df.fill_null(datetime(2020, 1, 1))
Out[6]:
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which doesn't look right (why does fill_null upcast?), whereas on your branch one gets:

In [1]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [2]: df.fill_null(datetime(2020, 1, 1))
Out[2]: 
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which is what I'd have expected

So, I think it would be correct to update the failing tests here

MarcoGorelli · 2023-10-29T18:09:48Z

crates/polars-core/src/utils/mod.rs

 pub fn get_time_units(tu_l: &TimeUnit, tu_r: &TimeUnit) -> TimeUnit {
    use TimeUnit::*;
    match (tu_l, tu_r) {
-        (Nanoseconds, Microseconds) => Microseconds,
-        (_, Milliseconds) => Milliseconds,
+        (_, Nanoseconds) => Nanoseconds,
+        (Milliseconds, Microseconds) => Microseconds,
        _ => *tu_l,
    }
 }


so this always results in the highest precision between tu_l and tu_r, right?

mn -> n un -> n nn -> n mu -> u uu -> u nu -> n mm -> m um -> u nm -> n

Yes,

9 possible variants (3*3).

3 of them are L=R which is a trivial case.

The variants in which L != R (6):
If Nanoseconds on the right result in Nanoseconds (first match case) = 2 cases
If Nanoseconds on the left result in Nanoseconds (last match case) = 2 cases
ms, us => ms (second match case)
us, ms => us (last match case)

9 variants isn't a huge number, maybe this is easier to reason about if you just write them all out?

@MarcoGorelli added a new commit to explicitly list all cases. If anyone has a different opinion on that I can revert it easily

MarcoGorelli · 2023-10-29T18:11:55Z

py-polars/tests/unit/functions/as_datatype/test_duration.py

@@ -143,6 +143,51 @@ def test_add_duration_3786() -> None:
    }


+def test_datetime_ms_add_duration_us() -> None:


looks like you've made two tests:

ms and us

us and ns

why not parametrise over the time units?

so you'd have something like

@pytest.mark.parametrize( ('left_time_unit', 'right_time_unit', 'expected_time_unit'), [ ('ms', 'ms', 'ms'), .... ]

Makes sense, will do that

@MarcoGorelli

I've parametrised the test with all the possible time units combinations (had to add some helper dictionaries to get the correct multipliers / parameters in each case).

I've made it to have two cases: a duration added on the right side and vice versa, i.e. datetime + duration & duration + datetime, given that the operations are not "perfectly" commutative. Could be a bit out of scope though (let me know if you think that's too overkill).

Giuzzilla · 2023-10-30T21:43:13Z

Thanks for your PR!

test_fill_null_temporal

I think this test needs updating.

On the latest release we see:

In [4]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [5]: df
Out[5]:
shape: (1, 1)
┌──────────────┐
│ ts           │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

In [6]: df.fill_null(datetime(2020, 1, 1))
Out[6]:
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which doesn't look right (why does fill_null upcast?), whereas on your branch one gets:

In [1]: df = pl.DataFrame({'ts': [None]}, schema={'ts': pl.Datetime('ns')})

In [2]: df.fill_null(datetime(2020, 1, 1))
Out[2]: 
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
└─────────────────────┘

which is what I'd have expected

So, I think it would be correct to update the failing tests here

Updated both failing tests to make them pass

py-polars/tests/unit/namespaces/test_datetime.py

MarcoGorelli

the code changes look good, just a little concerned about the complexity in the tests (especially logic like expected = dt_value * dt_multiplier + offset - wondering if there's a simpler/shorter way of writing them with less logic?

py-polars/tests/unit/namespaces/test_datetime.py

MarcoGorelli · 2023-10-31T10:34:13Z

py-polars/tests/unit/functions/as_datatype/test_duration.py

+    verbose_duration_unit = {
+        "ms": "milliseconds",
+        "us": "microseconds",
+        "ns": "nanoseconds",
+    }[duration_time_unit]
+
+    duration = pl.duration(
+        **{verbose_duration_unit: 1},
+        time_unit=duration_time_unit,
+    )
+
+    df_add_right = df_datetimes.select((pl.col("ts") + duration).alias("ts"))
+    df_add_left = df_datetimes.select((duration + pl.col("ts")).alias("ts"))
+
+    multipliers = {
+        "ms": 1_000_000,
+        "us": 1_000,
+        "ns": 1,
+    }
+    dt_multiplier = multipliers[dt_time_unit]
+    offset = multipliers[duration_time_unit]


this is looking quite complicated for a test, is there a simpler way to write it?

Giuzzilla · 2023-10-31T11:53:27Z

@MarcoGorelli I will fiddle with it a bit more.

One simple way could be to get rid of (some?) parametrisation and be more verbose with more (explicit) tests.

Sadly when you mix these timeunits freely you need a bit more creativity, I couldn't find an easier way to get them on the same ground.

Giuzzilla · 2023-10-31T13:39:22Z

@MarcoGorelli I've tried to simplify the tests further:

Made more explicit the mapping between the literals & the nanosecond conversion
Getting nanoseconds directly from duration instead of calculating it through the multiplier (so we can get rid of the multiplier dict altogether).
Made an explicit parameter for add-left vs add-right in the first test (less code repetition)
Got rid of made-up terms like "offset", which was indeed confusing

Giuzzilla · 2023-10-31T13:50:35Z

For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental

MarcoGorelli · 2023-10-31T21:26:41Z

For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental

It's really good that you've included this, thanks!

I've just pushed a commit to further remove some logic from the test (including a hand-calculated expected_nanoseconds), and have used assert_frame_equal - fancy making a similar change to the test you added for subtraction between datetimes?

Giuzzilla · 2023-10-31T22:17:25Z

For the add-left vs add-right, I could get rid of it altogether and simplify further if you want. It's not that fundamental

It's really good that you've included this, thanks!

I've just pushed a commit to further remove some logic from the test (including a hand-calculated expected_nanoseconds), and have used assert_frame_equal - fancy making a similar change to the test you added for subtraction between datetimes?

Your changes look great. I've done the same thing on the subtraction of datetimes. I've kept the subtraction operation explicit in the parameters there just to make it clear where those numbers come from

MarcoGorelli

Nice, thanks @Giuzzilla !

MarcoGorelli · 2023-12-08T09:39:09Z

@ritchie46 gentle ping on this one (I think it should be ready, and fixes the linked issue nicely. The gist of the change is in crates/polars-core/src/utils/mod.rs, the rest is updating tests)

stinodego

I did some minor code cleanup.

However, I don't think this can be merged in its current form. The reason that supertypes were defined as the lowest precision among time units is because of overflows:

import polars as pl

df = pl.DataFrame(
    {"datetime": [0], "duration": [2**60]},
    schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
)

result = df.select(pl.col("datetime") + pl.col("duration"))

print(result.dtypes[0].time_unit)
print(result.cast(pl.Int64))

If you run this on main, you get the correct result (a Datetime of type ms). In this PR, you get 0 as a result, which is clearly wrong.

So this PR must be updated to correctly handle overflows (error?). Or we should keep the existing behavior.

…een Datetime/durations

Expecting us representation, which is the max precision of a python datetime

…anoseconds

…frame_equal)

stinodego · 2024-02-23T21:53:25Z

Thanks @Giuzzilla but we want to keep the existing behavior. See #12023 (comment) for explanation.

github-actions bot added fix Bug fix rust Related to Rust Polars labels Oct 28, 2023

MarcoGorelli self-requested a review October 28, 2023 21:09

MarcoGorelli reviewed Oct 29, 2023

View reviewed changes

Giuzzilla commented Oct 30, 2023

View reviewed changes

py-polars/tests/unit/namespaces/test_datetime.py Outdated Show resolved Hide resolved

MarcoGorelli reviewed Oct 31, 2023

View reviewed changes

Giuzzilla marked this pull request as ready for review October 31, 2023 22:25

Giuzzilla requested review from ritchie46, stinodego, alexander-beedie and orlp as code owners October 31, 2023 22:25

MarcoGorelli approved these changes Nov 1, 2023

View reviewed changes

Giuzzilla force-pushed the fix-timeunit-downgrading branch from ef90999 to e7770fe Compare November 27, 2023 12:49

Giuzzilla force-pushed the fix-timeunit-downgrading branch from e7770fe to 96600d3 Compare January 8, 2024 17:25

Giuzzilla requested a review from c-peters as a code owner January 8, 2024 17:25

stinodego changed the title ~~fix(rust): us/ns TimeUnits downgraded to ms/us on arithmetic ops between Datetime/durations~~ fix: us/ns TimeUnits downgraded to ms/us on arithmetic ops between Datetime/durations Feb 13, 2024

github-actions bot added the python Related to Python Polars label Feb 13, 2024

stinodego force-pushed the fix-timeunit-downgrading branch from 5138f20 to d05b0b4 Compare February 13, 2024 10:31

stinodego changed the title ~~fix: us/ns TimeUnits downgraded to ms/us on arithmetic ops between Datetime/durations~~ fix: Keep highest time unit precision in arithmetic between Datetime/Duration Feb 13, 2024

stinodego suggested changes Feb 13, 2024

View reviewed changes

stinodego force-pushed the fix-timeunit-downgrading branch from 2e7afa4 to e5bfc7a Compare February 13, 2024 11:08

Giuzzilla and others added 15 commits February 14, 2024 10:56

fix(rust): us/ns TimeUnits downgraded to ms/us on arithmetic ops betw…

35baa6c

…een Datetime/durations

additional assertion on test to please mypy

4529b37

fix test_lazy_query_10 (expect nanoseconds)

9706e5f

fix test_fill_null_temporal when filling with ns

0a7428b

Expecting us representation, which is the max precision of a python datetime

parametrize python test

cef6efb

add types to test parameters

e6f0514

additional test on subtracting two datetimes

f82d54e

simplifications of new tests

b82da0e

further simplify test_subtract_datetimes_keep_subsecond_precision

c8b11e5

further remove logic from test, use assert_frame_equal

42e2386

simplify test of subtraction of datetimes by parametrising expected n…

81ad1ca

…anoseconds

removed assertion of expect dtype of duration (redundant with assert_…

f52e57e

…frame_equal)

refactor: explicit match conditions in get_time_units

163f652

Simplify code

286bf79

Move tests to appropriate module

4ca271c

stinodego force-pushed the fix-timeunit-downgrading branch from e5bfc7a to 4ca271c Compare February 14, 2024 09:56

stinodego mentioned this pull request Feb 23, 2024

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

Closed

2 tasks

stinodego closed this Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Keep highest time unit precision in arithmetic between `Datetime`/`Duration` #12086

fix: Keep highest time unit precision in arithmetic between `Datetime`/`Duration` #12086

Giuzzilla commented Oct 28, 2023 •

edited

Loading

Giuzzilla commented Oct 28, 2023 •

edited

Loading

MarcoGorelli commented Oct 29, 2023 •

edited

Loading

MarcoGorelli Oct 29, 2023

Giuzzilla Oct 29, 2023

MarcoGorelli Jan 8, 2024

Giuzzilla Jan 8, 2024 •

edited

Loading

MarcoGorelli Oct 29, 2023

Giuzzilla Oct 29, 2023

Giuzzilla Oct 30, 2023

Giuzzilla commented Oct 30, 2023 •

edited

Loading

MarcoGorelli left a comment

MarcoGorelli Oct 31, 2023

Giuzzilla commented Oct 31, 2023

Giuzzilla commented Oct 31, 2023 •

edited

Loading

Giuzzilla commented Oct 31, 2023 •

edited

Loading

MarcoGorelli commented Oct 31, 2023

Giuzzilla commented Oct 31, 2023

MarcoGorelli left a comment

MarcoGorelli commented Dec 8, 2023

stinodego left a comment

stinodego commented Feb 23, 2024

		@@ -143,6 +143,51 @@ def test_add_duration_3786() -> None:
		}


		def test_datetime_ms_add_duration_us() -> None:

fix: Keep highest time unit precision in arithmetic between Datetime/Duration #12086

fix: Keep highest time unit precision in arithmetic between Datetime/Duration #12086

Conversation

Giuzzilla commented Oct 28, 2023 • edited Loading

Giuzzilla commented Oct 28, 2023 • edited Loading

MarcoGorelli commented Oct 29, 2023 • edited Loading

MarcoGorelli Oct 29, 2023

Choose a reason for hiding this comment

Giuzzilla Oct 29, 2023

Choose a reason for hiding this comment

MarcoGorelli Jan 8, 2024

Choose a reason for hiding this comment

Giuzzilla Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Oct 29, 2023

Choose a reason for hiding this comment

Giuzzilla Oct 29, 2023

Choose a reason for hiding this comment

Giuzzilla Oct 30, 2023

Choose a reason for hiding this comment

Giuzzilla commented Oct 30, 2023 • edited Loading

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Oct 31, 2023

Choose a reason for hiding this comment

Giuzzilla commented Oct 31, 2023

Giuzzilla commented Oct 31, 2023 • edited Loading

Giuzzilla commented Oct 31, 2023 • edited Loading

MarcoGorelli commented Oct 31, 2023

Giuzzilla commented Oct 31, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Dec 8, 2023

stinodego left a comment

Choose a reason for hiding this comment

stinodego commented Feb 23, 2024

fix: Keep highest time unit precision in arithmetic between `Datetime`/`Duration` #12086

fix: Keep highest time unit precision in arithmetic between `Datetime`/`Duration` #12086

Giuzzilla commented Oct 28, 2023 •

edited

Loading

Giuzzilla commented Oct 28, 2023 •

edited

Loading

MarcoGorelli commented Oct 29, 2023 •

edited

Loading

Giuzzilla Jan 8, 2024 •

edited

Loading

Giuzzilla commented Oct 30, 2023 •

edited

Loading

Giuzzilla commented Oct 31, 2023 •

edited

Loading

Giuzzilla commented Oct 31, 2023 •

edited

Loading