-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve order of variables in combine_by_coords #9070
Conversation
I think the reason I originally put the sort call in there was because in the
But if it seems to work without that then I guess it's fine? |
I think this was from times where we had to deal with unsorted dict. At least that was what I understood from a previous comment at that code position. |
Now, in light of this... But somehow it works. |
OK, let me add some more testing to be sure this works in datasets of any order. |
the reason |
Thanks @keewis. AFAICT, the sorting before groupby makes sure that all Datasets with same variables are grouped together regardless of the variable order within each Dataset. Update: And the position in the object list. Removing the sorting will result in more groups but doesn't break the tests. Does that mean we are undertesting?. Or is it just fixed by the subsequent merge? The issue which should be solved by this PR is that this sorting rearranges the input objects and when using |
After reading on itertools and collections I've found that it's possible to change itertools.groupby with a collections.defaultdict implementation to preserve order and with some intermediate performance gain: import xarray as xr
import collections
import itertools
def vars_as_keys(ds):
return tuple(sorted(ds))
def groupby_defaultdict(iter, key=lambda x: x):
idx = collections.defaultdict(list)
for i, obj in enumerate(iter):
idx[key(obj)].append(i)
for k, ix in idx.items():
yield k, (iter[i] for i in ix)
def groupby_itertools(iter, key=lambda x: x):
iter = sorted(iter, key=vars_as_keys)
return itertools.groupby(iter, key=vars_as_keys)
x1 = xr.Dataset({"a": (("y", "x"), [[1]]),
"c": (("y", "x"), [[1]]),
"b": (("y", "x"), [[1]]),},
coords={"y": [0], "x": [0]})
x2 = xr.Dataset({"d": (("y", "x"), [[1]]),
"a": (("y", "x"), [[2]]),},
coords={"y": [0], "x": [0]})
x3 = xr.Dataset({"a": (("y", "x"), [[3]]),
"d": (("y", "x"), [[2]]),},
coords={"y": [0], "x": [1]})
data_objects = [x2, x1, x3] %%timeit
grouped_by_vars = groupby_defaultdict(data_objects, key=vars_as_keys)
274 ns ± 0.934 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) %%timeit
grouped_by_vars = groupby_itertools(data_objects, key=vars_as_keys)
4.98 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) |
It looks like this can be replaced here too: Lines 258 to 261 in 447e5a3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some type hints works be nice!
From all what I read about the comparison Both have The major PRO for I'm not sure if there is another solution including |
Co-authored-by: Michael Niklas <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
I've fixed the typing issue and replaced the second occurrence here. For the above shown minibenchmark there is a slight performance gain. The asv bench doesn't show a performance change. The major win is that it fixes #8828. Ready for another review. |
Do you see any blockers here @keewis, @TomNicholas? |
I've added this to #10002 and will merge this tomorrow, if there are no complaints. |
Sorry for taking so long to reply to this 😓
IIRC the point of the groupby is to group datasets into sets whose variables have common concatenation dimensions. See CWorthy-ocean/roms-tools#223 (comment) for an example of a use case.
I think as long as those are just subgroups of the original groups (i.e. those created by the code that does sorting) then yes the subsequent merge should fix it. The reason being that any subset of a group containing variables with common concatenation dimensions should also have those same common concatenation dimensions? |
No worries :-) . So, essentially the proposed code does the same groupby, but keeps the initial order. Did you had a chance to test your notebook against this PR? |
Just for the record, the example from above with the according output. We can see that the defaultdict_groupby generates the same groups and keeps them in their natural order. I think this is along the lines the behaviour @keewis mentioned for import xarray as xr
import collections
import itertools
def vars_as_keys(ds):
return tuple(sorted(ds))
def groupby_defaultdict(iter, key=lambda x: x):
idx = collections.defaultdict(list)
for i, obj in enumerate(iter):
idx[key(obj)].append(i)
for k, ix in idx.items():
yield k, (iter[i] for i in ix)
def groupby_itertools(iter, key=lambda x: x):
iter = sorted(iter, key=vars_as_keys)
return itertools.groupby(iter, key=vars_as_keys)
x1 = xr.Dataset({"a": (("y", "x"), [[1]]),
"c": (("y", "x"), [[1]]),
"b": (("y", "x"), [[1]]),},
coords={"y": [0], "x": [0]})
x2 = xr.Dataset({"d": (("y", "x"), [[1]]),
"a": (("y", "x"), [[2]]),},
coords={"y": [0], "x": [0]})
x3 = xr.Dataset({"a": (("y", "x"), [[3]]),
"d": (("y", "x"), [[2]]),},
coords={"y": [0], "x": [1]})
data_objects = [x2, x1, x3]
for vars, ds_with_same_vars in groupby_defaultdict(data_objects, key=vars_as_keys):
print(vars, tuple(ds_with_same_vars))
print("---------------------------------------------------------------------")
for vars, ds_with_same_vars in groupby_itertools(data_objects, key=vars_as_keys):
print(vars, tuple(ds_with_same_vars)) ('a', 'd') (<xarray.Dataset> Size: 32B
Dimensions: (y: 1, x: 1)
Coordinates:
* y (y) int64 8B 0
* x (x) int64 8B 0
Data variables:
d (y, x) int64 8B 1
a (y, x) int64 8B 2, <xarray.Dataset> Size: 32B
Dimensions: (y: 1, x: 1)
Coordinates:
* y (y) int64 8B 0
* x (x) int64 8B 1
Data variables:
a (y, x) int64 8B 3
d (y, x) int64 8B 2)
('a', 'b', 'c') (<xarray.Dataset> Size: 40B
Dimensions: (y: 1, x: 1)
Coordinates:
* y (y) int64 8B 0
* x (x) int64 8B 0
Data variables:
a (y, x) int64 8B 1
c (y, x) int64 8B 1
b (y, x) int64 8B 1,)
---------------------------------------------------------------------
('a', 'b', 'c') (<xarray.Dataset> Size: 40B
Dimensions: (y: 1, x: 1)
Coordinates:
* y (y) int64 8B 0
* x (x) int64 8B 0
Data variables:
a (y, x) int64 8B 1
c (y, x) int64 8B 1
b (y, x) int64 8B 1,)
('a', 'd') (<xarray.Dataset> Size: 32B
Dimensions: (y: 1, x: 1)
Coordinates:
* y (y) int64 8B 0
* x (x) int64 8B 0
Data variables:
d (y, x) int64 8B 1
a (y, x) int64 8B 2, <xarray.Dataset> Size: 32B
Dimensions: (y: 1, x: 1)
Coordinates:
* y (y) int64 8B 0
* x (x) int64 8B 1
Data variables:
a (y, x) int64 8B 3
d (y, x) int64 8B 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see that the
groupby_defaultdict
generates the same groups and keeps them in their natural order
This sounds fine to me.
whats-new.rst