Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing pint-pandas dependency #234

Open
lewisjared opened this issue May 5, 2023 · 7 comments
Open

Removing pint-pandas dependency #234

lewisjared opened this issue May 5, 2023 · 7 comments
Labels
bug Something isn't working dependencies Pull requests that update a dependency file

Comments

@lewisjared
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Pint-pandas doesn't play nice with the latest version of pint as seen in #233 where things explode on import. This also means that scmdata can't be used in PRIMAP2 due to pint version conflicts.

Do we really need the pint-pandas dependency here? At a glance it is only used in ops.

Describe the solution you'd like

We have to do one of the following:

  • Remove the pint_pandas dependency in exchange with lower-level pint usage
  • Make an upstream fix

I haven't dived into the relative effort required for an upstream fix.

@mikapfl @znicholls being the pint experts is there a preferred path here?

@lewisjared lewisjared added bug Something isn't working dependencies Pull requests that update a dependency file labels May 5, 2023
@lewisjared
Copy link
Collaborator Author

Might be a good time to think about #72 as well

@znicholls
Copy link
Collaborator

@mikapfl @znicholls being the pint experts is there a preferred path here?

My instinct is "remove the pint_pandas dependency in exchange with lower-level pint usage". Pint-pandas has always been pretty imperfect in my opinion, most clearly illustrated by the fact that these three MRs have gone round in circles then died in a ditch:

The tl;dr of those is that pandas has all sorts of weird quirks and whacking them all away without the pandas devs paying much attention is way too hard, and my current impression is the pandas devs aren't that interested.

The nuclear option is switching to an xarray backend and letting that do the work. It looks like the xarray devs are putting huge effort into this (https://xarray.dev/blog/introducing-pint-xarray) so they're probably the right horse to jump on, but that's a much bigger change.

@znicholls
Copy link
Collaborator

If anyone wants to play around, some xarray exploration I just did is below. My thinking is that tweaking our backend to be xarray would be fiddly, but maybe doable and we would get so much support effectively for free.

xarray exploration here
import numpy as np
import pint
import pint_xarray
import xarray as xr

time  = np.array([2010, 2015, 2020])

temperature = xr.DataArray(
    [10, 12, 15],
    coords={"time": time},
).pint.quantify("K")

speed = xr.DataArray(
    [4, 8, 12],
    coords={"time": time},
).pint.quantify("m / s")

emissions = xr.DataArray(
    [10, 15, 18],
    coords={"time": time},
).pint.quantify("t / s")


ds = xr.Dataset({"T": temperature, "v": speed, "e": emissions}).pint.quantify({"time": "yr"})
display(ds)

ds + ds

ds / ds

ds ** 2

ds / pint.Quantity(3, "m")

# This silently drops anything that doesn't line up, it's not clear to me
# whether we would want that or not (the pandas behaviour is to have nan
# instead)
ds / ds[["T", "v"]]

# This doesn't really do what we want because it isn't aware of the units in
# time. I expect adding a PR for this probably wouldn't be that hard (existing
# issue is here https://github.com/xarray-contrib/pint-xarray/issues/205)
ds.integrate("time")

temperature_b = xr.DataArray(
    [1, 2, 3],
    coords={"time": time},
).pint.quantify("mK")

ds_b = xr.Dataset({"T": temperature_b, "v": speed * 2, "e": emissions * 4})

# so smooth
display(ds + ds_b)

ds_c = xr.Dataset({"T": temperature_b ** 2, "v": speed, "e": emissions})

# blows up as expected, error message isn't perfect but I'm sure PRs would be
# accepted
ds + ds_c


temperature_mi = xr.DataArray(
    [[10, 12, 15], [3, 4, 8]],
    coords={"_index": [0, 1], "time": time},
).pint.quantify("K")

speed_mi = xr.DataArray(
    [[4, 8, 12]],
    coords={"_index": [0], "time": time},
).pint.quantify("m / s")

emissions_mi = xr.DataArray(
    [[10, 15, 18]],
    coords={"_index": [0], "time": time},
).pint.quantify("t / s")

# This isn't perfect because you have nans in "v" and "e" for _index equal to
# 1. However, we tend to have datasets where the variables are the same across
# all other dimensions (e.g. we have datasets where we have a bunch of
# scenarios, all of which report the same variables). In this 'majority use
# case' this solution wouldn't lead to sparse arrays. The only time we would
# get sparse arrays would be when we have datasets where the variables aren't
# consistent across all other dimensions. It's basically the same problem as
# what we solved with our `to_xarray` method via the `extras` argument, but
# we'd just have to apply it internally too. The question of course is whether
# PRIMAP is now a majority use case and whether this solution would just fall
# over in PRIMAP land...
ds_mi = xr.Dataset({"T": temperature_mi, "v": speed_mi, "e": emissions_mi})

@znicholls
Copy link
Collaborator

Decision from discussion: drop down to using pint directly internally

Follow up issues: #236 #237

@phackstock
Copy link

Resolving this would be great as there's a new issue with pymagicc. The issue is (I think) that pymagicc requires pint-pandas 0.4, which does not work with pandas 2.0. Jumping to pint-pandas 0.5 requires pint 0.22.

@andrewgsavage
Copy link

@mikapfl @znicholls being the pint experts is there a preferred path here?

My instinct is "remove the pint_pandas dependency in exchange with lower-level pint usage". Pint-pandas has always been pretty imperfect in my opinion, most clearly illustrated by the fact that these three MRs have gone round in circles then died in a ditch:

The tl;dr of those is that pandas has all sorts of weird quirks and whacking them all away without the pandas devs paying much attention is way too hard, and my current impression is the pandas devs aren't that interested.

The nuclear option is switching to an xarray backend and letting that do the work. It looks like the xarray devs are putting huge effort into this (https://xarray.dev/blog/introducing-pint-xarray) so they're probably the right horse to jump on, but that's a much bigger change.

Ever since pandas-dev/pandas#44626 pandas has been behaving nicely with pint-pandas

@znicholls
Copy link
Collaborator

Ever since pandas-dev/pandas#44626 pandas has been behaving nicely with pint-pandas

That's good to know, thanks! We may still drop to pint internally because casting back and forth with pandas is a relatively expensive operation for what we want to do. We're not moving very fast on this anyway so let's see where we are once we do actually make a push.

If you see a way to do operations on a basis that is bigger than column-by-column I'd be interested to hear it. I'm not sure it's possible given pandas' design and the trickiness of carrying units around with stuff, but maybe there is a way (I think pandas broadcasts internally to do e.g. .add and that would cause havoc with units I think, but I should probably try and do some profiling before coming to strong conclusions).

I started playing around with profiling and different ops implementations here if anyone is interested: https://github.com/openscm/scmdata/tree/ops-profiling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependencies Pull requests that update a dependency file
Projects
None yet
Development

No branches or pull requests

4 participants