Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

Closed
2 tasks done
djbarker opened this issue Oct 25, 2023 · 8 comments
Closed
2 tasks done

Adding duration changes "ns" time_unit to "us" (and loses info) #12023

djbarker opened this issue Oct 25, 2023 · 8 comments
Labels
A-temporal Area: date/time functionality bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@djbarker
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import datetime as dt
t = pl.datetime_range(dt.date.fromisoformat("2022-01-01"), dt.date.fromisoformat("2022-01-02"), interval="30m", eager=True, time_unit="ns")
d = pl.duration(microseconds=10)
X = pl.DataFrame({"t": t})
X.select(t_ns = (pl.col("t") + d).dt.cast_time_unit("ns"), t_us = pl.col("t") + d)

Log output

No response

Issue description

When adding a duration to a "ns" timestamp column the column we get back is in "us" and any nanos information is truncated.
This happens even if the duration is specified with some non-zero nanos component.

I did not find an issue logged for exactly this but it looks related to this issue: Issue 11747

Expected behavior

Should add the duration and maintain the timestamps as nanos (even if the duration is >= 1us).

Installed versions

--------Version info---------
Polars:              0.19.11
Index type:          UInt32
Platform:            Linux-5.14.0-162.6.1.el9_1.x86_64-x86_64-with-glibc2.34
Python:              3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.2.1
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.9.2
gevent:              <not installed>
matplotlib:          3.8.0
numpy:               1.25.2
openpyxl:            <not installed>
pandas:              1.5.3
pyarrow:             12.0.1
pydantic:            2.4.2
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          2.0.22
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@djbarker djbarker added bug Something isn't working python Related to Python Polars labels Oct 25, 2023
@MarcoGorelli
Copy link
Collaborator

thanks @djbarker for the report

simpler reproducer:

In [28]: df = pl.DataFrame({'ts': ['2020-01-01T00:00:00.123456789']}).select(pl.col.ts.str.to_datetime(time_unit='ns'))

In [29]: df.with_columns(ts_shifted=pl.col('ts')+pl.duration(microseconds=10))
Out[29]:
shape: (1, 2)
┌───────────────────────────────┬────────────────────────────┐
│ tsts_shifted                 │
│ ------                        │
│ datetime[ns]                  ┆ datetime[μs]               │
╞═══════════════════════════════╪════════════════════════════╡
│ 2020-01-01 00:00:00.1234567892020-01-01 00:00:00.123466 │
└───────────────────────────────┴────────────────────────────┘

Expected:

shape: (1, 2)
┌───────────────────────────────┬───────────────────────────────┐
│ tsts_shifted                    │
│ ------                           │
│ datetime[ns]                  ┆ datetime[ns]                  │
╞═══════════════════════════════╪═══════════════════════════════╡
│ 2020-01-01 00:00:00.1234567892020-01-01 00:00:00.123466789 │
└───────────────────────────────┴───────────────────────────────┘

@MarcoGorelli MarcoGorelli added the A-temporal Area: date/time functionality label Oct 28, 2023
@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 13, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 13, 2024
@stinodego
Copy link
Contributor

stinodego commented Feb 13, 2024

I am not sure what we should do here. Because keeping the highest precision time unit may cause overflows. The code below works correctly on main, but would be broken if we used nanosecond precision in the result.

import polars as pl

df = pl.DataFrame(
    {"datetime": [0], "duration": [2**60]},
    schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
)

result = df.select(pl.col("datetime") + pl.col("duration"))

print(result.dtypes[0].time_unit)  # ms
print(result.cast(pl.Int64))  # 2**60

@MarcoGorelli what do you think is the correct approach here?

@MarcoGorelli
Copy link
Collaborator

thanks, I hadn't noticed that. will take a close look

@Giuzzilla
Copy link

Giuzzilla commented Feb 13, 2024

@stinodego
Thanks a lot for your review on the PR.

I agree that the proposed change makes it more likely to have overflows.
About the general behaviour, would it be correct to say that even currently some operations between datetimes/durations "may" cause overflows?
e.g. tweaking the numbers of your example. (Note I'm not able to instantiate directly a duration that big so for my not-so-natural example I summed it twice)

In [18]: import polars as pl
    ...: 
    ...: df = pl.DataFrame(
    ...:     {"datetime": [0], "duration": [2**62]},
    ...:     schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ms")},
    ...: )
    ...: 
    ...: result = df.select(pl.col("datetime") + pl.col("duration") + pl.col("duration"))
    ...: 
    ...: print(result.dtypes[0].time_unit)
    ...: print(result.cast(pl.Int64))
ms
shape: (1, 1)
┌──────────────────────┐
│ datetime             │
│ ---                  │
│ i64                  │
╞══════════════════════╡
│ -9223372036854775808 │
└──────────────────────┘

or summing between two very big ns datetime/durations:

In [35]: import polars as pl
    ...: 
    ...: df = pl.DataFrame(
    ...:     {"datetime": [2**62], "duration": [2**62]},
    ...:     schema={"datetime": pl.Datetime("ns"), "duration": pl.Duration("ns")},
    ...: )
    ...: 
    ...: result = df.select(pl.col("datetime") + pl.col("duration"))
    ...: 
    ...: print(result.dtypes[0].time_unit)
    ...: print(result.cast(pl.Int64))
ns
shape: (1, 1)
┌──────────────────────┐
│ datetime             │
│ ---                  │
│ i64                  │
╞══════════════════════╡
│ -9223372036854775808 │
└──────────────────────┘

@stinodego
Copy link
Contributor

About the general behaviour, would it be correct to say that even currently some operations between datetimes/durations "may" cause overflows?

Definitely. Overflows can happen, also when adding two large integers, for example.

However, when adding integers of different types, a supertype is defined to try to prevent data loss due to overflows, e.g. Int32 + UInt32 = Int64. This is lossless. Another example is Int8 + UInt64 = Float64, which is potentially lossy.

Basically, we have to weigh the precision loss versus the overflow chance. Or we could just not define a supertype and raise when trying to operate on datatypes with different time units. Or there should be a strict cast somewhere that raises the error only if using the supertype will result in an overflow.

@MarcoGorelli
Copy link
Collaborator

Just for reference, looks like numpy also silently overflows here

In [45]: np.datetime64('1970-01-01', 'ns') + np.timedelta64(2**60, 'ms')
Out[45]: numpy.datetime64('1970-01-01T00:00:00.000000000')

In [46]: _.dtype
Out[46]: dtype('<M8[ns]')

However, I think I prefer your suggestion:

Or there should be a strict cast somewhere that raises the error only if using the supertype will result in an overflow.

Is there a precursor issue that needs fixing here though? The following silently overflows in Polars even with the (default) strict cast:

In [71]: pl.Series([2**63-1], dtype=pl.Duration('ms')).cast(pl.Duration('us'))
Out[71]:
shape: (1,)
Series: '' [duration[μs]]
[
        -1ms
]

Conversely, casting an integer from Int16 to Int8 raises:

In [72]: pl.Series([128], dtype=pl.Int16).cast(pl.Int8)
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[72], line 1
----> 1 pl.Series([128], dtype=pl.Int16).cast(pl.Int8)

File ~/tmp/.venv/lib/python3.10/site-packages/polars/series/series.py:4126, in Series.cast(self, dtype, strict)
   4124 # Do not dispatch cast as it is expensive and used in other functions.
   4125 dtype = py_type_to_dtype(dtype)
-> 4126 return self._from_pyseries(self._s.cast(dtype, strict))

ComputeError: conversion from `i16` to `i8` failed in column '' for 1 out of 1 values: [128]

@stinodego
Copy link
Contributor

stinodego commented Feb 22, 2024

The underlying issue there is that casting to a more precise subsecond involves multiplication of the underlying integer, and we do not have checked arithmetic. So we don't have a way to detect these overflows at the moment, I think.

EDIT: See #14655

@stinodego
Copy link
Contributor

After discussion today, this is not a bug and we want to keep the existing behavior. Overflows are worse than precision loss.

The correct approach here is to explicitly cast to ns if you want to keep the nanosecond precision:

X.select(pl.col("t") + d.dt.cast_time_unit("ns"))

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Feb 23, 2024
@stinodego stinodego removed this from Backlog Feb 23, 2024
@stinodego stinodego added the invalid A bug report that is not actually a bug label Feb 23, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 23, 2024
@stinodego stinodego removed the P-medium Priority: medium label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants