-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing pint-pandas dependency #234
Comments
Might be a good time to think about #72 as well |
My instinct is "remove the pint_pandas dependency in exchange with lower-level pint usage". Pint-pandas has always been pretty imperfect in my opinion, most clearly illustrated by the fact that these three MRs have gone round in circles then died in a ditch:
The tl;dr of those is that pandas has all sorts of weird quirks and whacking them all away without the pandas devs paying much attention is way too hard, and my current impression is the pandas devs aren't that interested. The nuclear option is switching to an xarray backend and letting that do the work. It looks like the xarray devs are putting huge effort into this (https://xarray.dev/blog/introducing-pint-xarray) so they're probably the right horse to jump on, but that's a much bigger change. |
If anyone wants to play around, some xarray exploration I just did is below. My thinking is that tweaking our backend to be xarray would be fiddly, but maybe doable and we would get so much support effectively for free. xarray exploration hereimport numpy as np
import pint
import pint_xarray
import xarray as xr
time = np.array([2010, 2015, 2020])
temperature = xr.DataArray(
[10, 12, 15],
coords={"time": time},
).pint.quantify("K")
speed = xr.DataArray(
[4, 8, 12],
coords={"time": time},
).pint.quantify("m / s")
emissions = xr.DataArray(
[10, 15, 18],
coords={"time": time},
).pint.quantify("t / s")
ds = xr.Dataset({"T": temperature, "v": speed, "e": emissions}).pint.quantify({"time": "yr"})
display(ds)
ds + ds
ds / ds
ds ** 2
ds / pint.Quantity(3, "m")
# This silently drops anything that doesn't line up, it's not clear to me
# whether we would want that or not (the pandas behaviour is to have nan
# instead)
ds / ds[["T", "v"]]
# This doesn't really do what we want because it isn't aware of the units in
# time. I expect adding a PR for this probably wouldn't be that hard (existing
# issue is here https://github.com/xarray-contrib/pint-xarray/issues/205)
ds.integrate("time")
temperature_b = xr.DataArray(
[1, 2, 3],
coords={"time": time},
).pint.quantify("mK")
ds_b = xr.Dataset({"T": temperature_b, "v": speed * 2, "e": emissions * 4})
# so smooth
display(ds + ds_b)
ds_c = xr.Dataset({"T": temperature_b ** 2, "v": speed, "e": emissions})
# blows up as expected, error message isn't perfect but I'm sure PRs would be
# accepted
ds + ds_c
temperature_mi = xr.DataArray(
[[10, 12, 15], [3, 4, 8]],
coords={"_index": [0, 1], "time": time},
).pint.quantify("K")
speed_mi = xr.DataArray(
[[4, 8, 12]],
coords={"_index": [0], "time": time},
).pint.quantify("m / s")
emissions_mi = xr.DataArray(
[[10, 15, 18]],
coords={"_index": [0], "time": time},
).pint.quantify("t / s")
# This isn't perfect because you have nans in "v" and "e" for _index equal to
# 1. However, we tend to have datasets where the variables are the same across
# all other dimensions (e.g. we have datasets where we have a bunch of
# scenarios, all of which report the same variables). In this 'majority use
# case' this solution wouldn't lead to sparse arrays. The only time we would
# get sparse arrays would be when we have datasets where the variables aren't
# consistent across all other dimensions. It's basically the same problem as
# what we solved with our `to_xarray` method via the `extras` argument, but
# we'd just have to apply it internally too. The question of course is whether
# PRIMAP is now a majority use case and whether this solution would just fall
# over in PRIMAP land...
ds_mi = xr.Dataset({"T": temperature_mi, "v": speed_mi, "e": emissions_mi}) |
Resolving this would be great as there's a new issue with pymagicc. The issue is (I think) that pymagicc requires pint-pandas 0.4, which does not work with pandas 2.0. Jumping to pint-pandas 0.5 requires pint 0.22. |
Ever since pandas-dev/pandas#44626 pandas has been behaving nicely with pint-pandas |
That's good to know, thanks! We may still drop to pint internally because casting back and forth with pandas is a relatively expensive operation for what we want to do. We're not moving very fast on this anyway so let's see where we are once we do actually make a push. If you see a way to do operations on a basis that is bigger than column-by-column I'd be interested to hear it. I'm not sure it's possible given pandas' design and the trickiness of carrying units around with stuff, but maybe there is a way (I think pandas broadcasts internally to do e.g. I started playing around with profiling and different ops implementations here if anyone is interested: https://github.com/openscm/scmdata/tree/ops-profiling |
Is your feature request related to a problem? Please describe.
Pint-pandas doesn't play nice with the latest version of pint as seen in #233 where things explode on import. This also means that scmdata can't be used in PRIMAP2 due to pint version conflicts.
Do we really need the pint-pandas dependency here? At a glance it is only used in ops.
Describe the solution you'd like
We have to do one of the following:
I haven't dived into the relative effort required for an upstream fix.
@mikapfl @znicholls being the pint experts is there a preferred path here?
The text was updated successfully, but these errors were encountered: