Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Preserve monotonic descending index order when merging #2972

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Huite
Copy link
Contributor

@Huite Huite commented May 18, 2019

  • Addresses GH2947

  • When indexes were joined in a dataset merge, they would always get sorted in ascending order. This is awkward for geospatial grids, which are nearly always descending in the "y" coordinate.

  • This also caused an inconsistency: when a merge is called on datasets with identical descending indexes, the resulting index is descending. When a merge is called with non-identical descending indexes, the resulting index is ascending.

  • When indexes are mixed ascending and descending, or non-monotonic, the resulting index is still sorted in ascending order.

Comments

I was doing some work and I kept running into the issue described at #2947, so I had a try at a fix. It was somewhat of a hassle to understand the issue because I kept running into seeming inconsistencies. This is caused by the fact that the joiner doesn't sort with a single index:

def _get_joiner(join):
    if join == 'outer':
        return functools.partial(functools.reduce, operator.or_)

That makes sense, since I'm guessing pandas.Index.union isn't get called at all. (I still find the workings of functools a little hard to infer.)

I also noticed that an outer join gets called with e.g. an .isel operation, even though there's only one index (so there's not really anything to join). However, skipping the join completely in that case makes several tests fail since dimension labels end up missing (I guess the joiner call takes care of it).

It's just checking for the specific case now, but it feels like an very specific issue anyway...

The merge behavior is slightly different now, which is reflected in the updated test outcomes in test_dataset.py. These tests were turning monotonic decreasing indexes into an increasing index; now the decreasing order is maintained.

* Addresses GH2947

* When indexes were joined in a dataset merge, they would always get sorted in ascending order. This is awkward for geospatial grids, which are nearly always descending in the "y" coordinate.

* This also caused an inconsistency: when a merge is called on datasets with identical descending indexes, the resulting index is descending. When a merge is called with non-identical descending indexes, the resulting index is ascending.

* When indexes are mixed ascending and descending, or non-monotonic, the resulting index is still sorted in ascending order.
@shoyer
Copy link
Member

shoyer commented May 19, 2019

I'm not entirely sure this is a good idea -- a discrepancy in the behavior of an outer join between pandas seems non-ideal. Perhaps somebody else has opinions about whether this would be worthwhile?

If we do want to do this, let's put the change inside the _get_joiner function. Specifically, instead of

def _get_joiner(join):
    if join == 'outer':
        return functools.partial(functools.reduce, operator.or_)

you could write something like:

def _outer_join(indexes):
    index = functools.reduce(indexes, operator.or_)
    # your logic for reverse sorting the result
    return index

def _get_joiner(join):
    if join == 'outer':
        return _outer_join

Note that operator.or_ on a pandas.Index is the same thing as pandas.Index.union, so this could be equivalently written (maybe more clearly) as:

def _outer_join(indexes):
    index = indexes[0]
    for other in indexes[1:]
        index = index.union(other)
    # your logic for reverse sorting the result
    return index

@Huite
Copy link
Contributor Author

Huite commented May 19, 2019

I'm definitely not convinced it's a great idea either; a pull request is hopefully the best way to get some discussion going! Putting it in the _get_joiner is definitely the sensible choice, and the last suggestion is definitely more clear to me.

I've had a bit more thought and maybe it's useful to consider what xarray's philosophy about these things is. I think flexibility is a primary concern and generally, things *just work*. The reason I'm running into issues here is because I'm writing some code which operates on DataArrays, which isn't nearly as robust or flexible, and doesn't just work.

The answer in this case might be that I should be using xarray's powerful alignment machinery to do the work for me, rather than to assume/demand certain features of the data. Of course, that requires some digging into how xarray does alignment. But I'd end up with a more flexible tool in the end.

Perhaps that should be the general rule: if you're extending xarray, like xarray, don't rely too much about your coordinates staying the way they are. Maybe such a description could belong in the xarray Internals page just to make it explicit.

@dcherian
Copy link
Contributor

From a user perspective, I think it's plain weird to merge a bunch of datasets with montonically decreasing coordinates and land up with a new dataset that has monotonically increasing coordinates.

@shoyer
Copy link
Member

shoyer commented May 27, 2019

Yes, I do think this deviation from pandas would make sense -- though it would be even better if we could convince pandas to make the same change!

We already have low-level tools for controlling how alignment works (construct the index yourself and use .reindex), but we should make the default behavior as user friendly as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xr.merge always sorts indexes ascending
3 participants