Adding duration changes "ns" time_unit to "us" (and loses info) #12023

djbarker · 2023-10-25T13:52:01Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import datetime as dt
t = pl.datetime_range(dt.date.fromisoformat("2022-01-01"), dt.date.fromisoformat("2022-01-02"), interval="30m", eager=True, time_unit="ns")
d = pl.duration(microseconds=10)
X = pl.DataFrame({"t": t})
X.select(t_ns = (pl.col("t") + d).dt.cast_time_unit("ns"), t_us = pl.col("t") + d)

Log output

No response

Issue description

When adding a duration to a "ns" timestamp column the column we get back is in "us" and any nanos information is truncated.
This happens even if the duration is specified with some non-zero nanos component.

I did not find an issue logged for exactly this but it looks related to this issue: Issue 11747

Expected behavior

Should add the duration and maintain the timestamps as nanos (even if the duration is >= 1us).

Installed versions

--------Version info---------
Polars:              0.19.11
Index type:          UInt32
Platform:            Linux-5.14.0-162.6.1.el9_1.x86_64-x86_64-with-glibc2.34
Python:              3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.2.1
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.9.2
gevent:              <not installed>
matplotlib:          3.8.0
numpy:               1.25.2
openpyxl:            <not installed>
pandas:              1.5.3
pyarrow:             12.0.1
pydantic:            2.4.2
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          2.0.22
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2023-10-28T16:36:54Z

thanks @djbarker for the report

simpler reproducer:

In [28]: df = pl.DataFrame({'ts': ['2020-01-01T00:00:00.123456789']}).select(pl.col.ts.str.to_datetime(time_unit='ns'))

In [29]: df.with_columns(ts_shifted=pl.col('ts')+pl.duration(microseconds=10))
Out[29]:
shape: (1, 2)
┌───────────────────────────────┬────────────────────────────┐
│ ts                            ┆ ts_shifted                 │
│ ---                           ┆ ---                        │
│ datetime[ns]                  ┆ datetime[μs]               │
╞═══════════════════════════════╪════════════════════════════╡
│ 2020-01-01 00:00:00.123456789 ┆ 2020-01-01 00:00:00.123466 │
└───────────────────────────────┴────────────────────────────┘

Expected:

shape: (1, 2)
┌───────────────────────────────┬───────────────────────────────┐
│ ts                            ┆ ts_shifted                    │
│ ---                           ┆ ---                           │
│ datetime[ns]                  ┆ datetime[ns]                  │
╞═══════════════════════════════╪═══════════════════════════════╡
│ 2020-01-01 00:00:00.123456789 ┆ 2020-01-01 00:00:00.123466789 │
└───────────────────────────────┴───────────────────────────────┘

stinodego · 2024-02-13T11:08:13Z

I am not sure what we should do here. Because keeping the highest precision time unit may cause overflows. The code below works correctly on main, but would be broken if we used nanosecond precision in the result.

import polars as pl

df = pl.DataFrame(
    {"datetime": [0], "duration": [2**60]},
    schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
)

result = df.select(pl.col("datetime") + pl.col("duration"))

print(result.dtypes[0].time_unit)  # ms
print(result.cast(pl.Int64))  # 2**60

@MarcoGorelli what do you think is the correct approach here?

MarcoGorelli · 2024-02-13T11:09:00Z

thanks, I hadn't noticed that. will take a close look

Giuzzilla · 2024-02-13T12:57:51Z

@stinodego
Thanks a lot for your review on the PR.

I agree that the proposed change makes it more likely to have overflows.
About the general behaviour, would it be correct to say that even currently some operations between datetimes/durations "may" cause overflows?
e.g. tweaking the numbers of your example. (Note I'm not able to instantiate directly a duration that big so for my not-so-natural example I summed it twice)

In [18]: import polars as pl
    ...: 
    ...: df = pl.DataFrame(
    ...:     {"datetime": [0], "duration": [2**62]},
    ...:     schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
    ...: )
    ...: 
    ...: result = df.select(pl.col("datetime") + pl.col("duration") + pl.col("duration"))
    ...: 
    ...: print(result.dtypes[0].time_unit)
    ...: print(result.cast(pl.Int64))
ms
shape: (1, 1)
┌──────────────────────┐
│ datetime             │
│ ---                  │
│ i64                  │
╞══════════════════════╡
│ -9223372036854775808 │
└──────────────────────┘

or summing between two very big ns datetime/durations:

In [35]: import polars as pl
    ...: 
    ...: df = pl.DataFrame(
    ...:     {"datetime": [2**62], "duration": [2**62]},
    ...:     schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ns")},
    ...: )
    ...: 
    ...: result = df.select(pl.col("datetime") + pl.col("duration"))
    ...: 
    ...: print(result.dtypes[0].time_unit)
    ...: print(result.cast(pl.Int64))
ns
shape: (1, 1)
┌──────────────────────┐
│ datetime             │
│ ---                  │
│ i64                  │
╞══════════════════════╡
│ -9223372036854775808 │
└──────────────────────┘

stinodego · 2024-02-13T13:35:23Z

About the general behaviour, would it be correct to say that even currently some operations between datetimes/durations "may" cause overflows?

Definitely. Overflows can happen, also when adding two large integers, for example.

However, when adding integers of different types, a supertype is defined to try to prevent data loss due to overflows, e.g. Int32 + UInt32 = Int64. This is lossless. Another example is Int8 + UInt64 = Float64, which is potentially lossy.

Basically, we have to weigh the precision loss versus the overflow chance. Or we could just not define a supertype and raise when trying to operate on datatypes with different time units. Or there should be a strict cast somewhere that raises the error only if using the supertype will result in an overflow.

MarcoGorelli · 2024-02-21T18:37:11Z

Just for reference, looks like numpy also silently overflows here

In [45]: np.datetime64('1970-01-01', 'ns') + np.timedelta64(2**60, 'ms')
Out[45]: numpy.datetime64('1970-01-01T00:00:00.000000000')

In [46]: _.dtype
Out[46]: dtype('<M8[ns]')

However, I think I prefer your suggestion:

Or there should be a strict cast somewhere that raises the error only if using the supertype will result in an overflow.

Is there a precursor issue that needs fixing here though? The following silently overflows in Polars even with the (default) strict cast:

In [71]: pl.Series([2**63-1], dtype=pl.Duration('ms')).cast(pl.Duration('us'))
Out[71]:
shape: (1,)
Series: '' [duration[μs]]
[
        -1ms
]

Conversely, casting an integer from Int16 to Int8 raises:

In [72]: pl.Series([128], dtype=pl.Int16).cast(pl.Int8)
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[72], line 1
----> 1 pl.Series([128], dtype=pl.Int16).cast(pl.Int8)

File ~/tmp/.venv/lib/python3.10/site-packages/polars/series/series.py:4126, in Series.cast(self, dtype, strict)
   4124 # Do not dispatch cast as it is expensive and used in other functions.
   4125 dtype = py_type_to_dtype(dtype)
-> 4126 return self._from_pyseries(self._s.cast(dtype, strict))

ComputeError: conversion from `i16` to `i8` failed in column '' for 1 out of 1 values: [128]

stinodego · 2024-02-22T07:53:50Z

The underlying issue there is that casting to a more precise subsecond involves multiplication of the underlying integer, and we do not have checked arithmetic. So we don't have a way to detect these overflows at the moment, I think.

EDIT: See #14655

stinodego · 2024-02-23T21:52:48Z

After discussion today, this is not a bug and we want to keep the existing behavior. Overflows are worse than precision loss.

The correct approach here is to explicitly cast to ns if you want to keep the nanosecond precision:

X.select(pl.col("t") + d.dt.cast_time_unit("ns"))

djbarker added bug Something isn't working python Related to Python Polars labels Oct 25, 2023

MarcoGorelli added the A-temporal Area: date/time functionality label Oct 28, 2023

Giuzzilla mentioned this issue Oct 28, 2023

fix: Keep highest time unit precision in arithmetic between Datetime/Duration #12086

Closed

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 13, 2024

github-project-automation bot added this to Backlog Feb 13, 2024

github-project-automation bot moved this to Ready in Backlog Feb 13, 2024

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2024

github-project-automation bot moved this from Ready to Done in Backlog Feb 23, 2024

stinodego removed this from Backlog Feb 23, 2024

stinodego added the invalid A bug report that is not actually a bug label Feb 23, 2024

github-project-automation bot added this to Backlog Feb 23, 2024

github-project-automation bot moved this to Ready in Backlog Feb 23, 2024

stinodego removed the P-medium Priority: medium label Feb 23, 2024

MarcoGorelli mentioned this issue Dec 22, 2024

Upcast date + duration and date - duration operations to datetime #20397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

djbarker commented Oct 25, 2023

MarcoGorelli commented Oct 28, 2023

stinodego commented Feb 13, 2024 •

edited

Loading

MarcoGorelli commented Feb 13, 2024

Giuzzilla commented Feb 13, 2024 •

edited

Loading

stinodego commented Feb 13, 2024

MarcoGorelli commented Feb 21, 2024

stinodego commented Feb 22, 2024 •

edited

Loading

stinodego commented Feb 23, 2024

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

Comments

djbarker commented Oct 25, 2023

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Oct 28, 2023

stinodego commented Feb 13, 2024 • edited Loading

MarcoGorelli commented Feb 13, 2024

Giuzzilla commented Feb 13, 2024 • edited Loading

stinodego commented Feb 13, 2024

MarcoGorelli commented Feb 21, 2024

stinodego commented Feb 22, 2024 • edited Loading

stinodego commented Feb 23, 2024

stinodego commented Feb 13, 2024 •

edited

Loading

Giuzzilla commented Feb 13, 2024 •

edited

Loading

stinodego commented Feb 22, 2024 •

edited

Loading