-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor datetime and timedelta encoding for increased robustness #9498
base: main
Are you sure you want to change the base?
Refactor datetime and timedelta encoding for increased robustness #9498
Conversation
* Implement robust units checking for lazy encoding * Remove dtype casting, which interferes with missing value encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After line 332, we could have an int32
value for num_dates
(e.g. if the user provided an int32
value to begin with). This will be passed down to _decode_datetime_with_pandas
, which, upon calling pd.to_timedelta(flat_num_dates.min(), time_units)
will raise OutOfBoundsTimedelta: Cannot cast 24 from h to 'ns' without overflow
due to pandas not handling the int32
first argument. This is preventable by casting num_dates
to int64
.
I believe this is related to pandas-dev/pandas#56996
Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this. Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to |
@spencerkclark I won't have time to fix this, since we're also falling back on cftime so it works for us, for now. |
This PR makes the proposed updates in #9488 (comment), removing the guard against overflow in dtype casting. In so doing I have added units checking / potential modification to the cftime code path to bring things up to speed with the NumPy code path, and also added a regression test for #9134.
Now we more robustly raise an error in the case of chunked times where an integer dtype encoding is specified, but the units do not allow for an accurate roundtrip. This was the original intent of the code, though we can also discuss whether it is better to warn instead of raise in this instance—one way or another this would also make it safer to switch the default units with which we encode chunked times (#9154).
As @kmuehlbauer notes, this would be another way to fix #9488; in some ways it is complementary to #9497.
whats-new.rst