-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding duration changes "ns" time_unit to "us" (and loses info) #12023
Comments
thanks @djbarker for the report simpler reproducer: In [28]: df = pl.DataFrame({'ts': ['2020-01-01T00:00:00.123456789']}).select(pl.col.ts.str.to_datetime(time_unit='ns'))
In [29]: df.with_columns(ts_shifted=pl.col('ts')+pl.duration(microseconds=10))
Out[29]:
shape: (1, 2)
┌───────────────────────────────┬────────────────────────────┐
│ ts ┆ ts_shifted │
│ --- ┆ --- │
│ datetime[ns] ┆ datetime[μs] │
╞═══════════════════════════════╪════════════════════════════╡
│ 2020-01-01 00:00:00.123456789 ┆ 2020-01-01 00:00:00.123466 │
└───────────────────────────────┴────────────────────────────┘ Expected: shape: (1, 2)
┌───────────────────────────────┬───────────────────────────────┐
│ ts ┆ ts_shifted │
│ --- ┆ --- │
│ datetime[ns] ┆ datetime[ns] │
╞═══════════════════════════════╪═══════════════════════════════╡
│ 2020-01-01 00:00:00.123456789 ┆ 2020-01-01 00:00:00.123466789 │
└───────────────────────────────┴───────────────────────────────┘ |
I am not sure what we should do here. Because keeping the highest precision time unit may cause overflows. The code below works correctly on main, but would be broken if we used nanosecond precision in the result. import polars as pl
df = pl.DataFrame(
{"datetime": [0], "duration": [2**60]},
schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
)
result = df.select(pl.col("datetime") + pl.col("duration"))
print(result.dtypes[0].time_unit) # ms
print(result.cast(pl.Int64)) # 2**60 @MarcoGorelli what do you think is the correct approach here? |
thanks, I hadn't noticed that. will take a close look |
@stinodego I agree that the proposed change makes it more likely to have overflows. In [18]: import polars as pl
...:
...: df = pl.DataFrame(
...: {"datetime": [0], "duration": [2**62]},
...: schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
...: )
...:
...: result = df.select(pl.col("datetime") + pl.col("duration") + pl.col("duration"))
...:
...: print(result.dtypes[0].time_unit)
...: print(result.cast(pl.Int64))
ms
shape: (1, 1)
┌──────────────────────┐
│ datetime │
│ --- │
│ i64 │
╞══════════════════════╡
│ -9223372036854775808 │
└──────────────────────┘ or summing between two very big ns datetime/durations: In [35]: import polars as pl
...:
...: df = pl.DataFrame(
...: {"datetime": [2**62], "duration": [2**62]},
...: schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ns")},
...: )
...:
...: result = df.select(pl.col("datetime") + pl.col("duration"))
...:
...: print(result.dtypes[0].time_unit)
...: print(result.cast(pl.Int64))
ns
shape: (1, 1)
┌──────────────────────┐
│ datetime │
│ --- │
│ i64 │
╞══════════════════════╡
│ -9223372036854775808 │
└──────────────────────┘ |
Definitely. Overflows can happen, also when adding two large integers, for example. However, when adding integers of different types, a supertype is defined to try to prevent data loss due to overflows, e.g. Basically, we have to weigh the precision loss versus the overflow chance. Or we could just not define a supertype and raise when trying to operate on datatypes with different time units. Or there should be a strict cast somewhere that raises the error only if using the supertype will result in an overflow. |
Just for reference, looks like numpy also silently overflows here
However, I think I prefer your suggestion:
Is there a precursor issue that needs fixing here though? The following silently overflows in Polars even with the (default) strict cast:
Conversely, casting an integer from Int16 to Int8 raises:
|
The underlying issue there is that casting to a more precise subsecond involves multiplication of the underlying integer, and we do not have checked arithmetic. So we don't have a way to detect these overflows at the moment, I think. EDIT: See #14655 |
After discussion today, this is not a bug and we want to keep the existing behavior. Overflows are worse than precision loss. The correct approach here is to explicitly cast to X.select(pl.col("t") + d.dt.cast_time_unit("ns")) |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
No response
Issue description
When adding a duration to a "ns" timestamp column the column we get back is in "us" and any nanos information is truncated.
This happens even if the duration is specified with some non-zero nanos component.
I did not find an issue logged for exactly this but it looks related to this issue: Issue 11747
Expected behavior
Should add the duration and maintain the timestamps as nanos (even if the duration is >= 1us).
Installed versions
The text was updated successfully, but these errors were encountered: