Skip to content

Commit 6b33ad8

Browse files
TomNicholasshoyer
authored andcommitted
API for N-dimensional combine (#2616)
* concatenates along a single dimension * Wrote function to find correct tile_IDs from nested list of datasets * Wrote function to check that combined_tile_ids structure is valid * Added test of 2d-concatenation * Tests now check that dataset ordering is correct * Test concatentation along a new dimension * Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm * All unit tests now passing * Fixed a failing test which I didn't notice because I don't have pseudoNetCDF * Began updating open_mfdataset to handle N-D input * Refactored to remove duplicate logic in open_mfdataset & auto_combine * Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator * --amend * Now raises ValueError if input not ordered correctly before concatenation * Added some more prototype tests defining desired behaviour more clearly * Now raises informative errors on invalid forms of input * Refactoring to alos merge along each dimension * Refactored to literally just apply the old auto_combine along each dimension * Added unit tests for open_mfdatset * Removed TODOs * Removed format strings * test_get_new_tile_ids now doesn't assume dicts are ordered * Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered * Test for getting new tile id * Fixed itertoolz import so that it's compatible with older versions * Increased test coverage * Added toolz as an explicit dependency to pass tests on python2.7 * Updated 'what's new' * No longer attempts to shortcut all concatenation at once if concat_dims=None * Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz * Fixed erroneous removal of utils import * Updated docstrings to include an example of multidimensional concatenation * Clarified auto_combine docstring for N-D behaviour * Added unit test for nested list of Datasets with different variables * Minor spelling and pep8 fixes * Started working on a new api with both auto_combine and manual_combine * Wrote basic function to infer concatenation order from coords. Needs better error handling though. * Attempt at finalised version of public-facing API. All the internals still need to be redone to match though. * No longer uses entire old auto_combine internally, only concat or merge * Updated what's new * Removed uneeded addition to what's new for old release * Fixed incomplete merge in docstring for open_mfdataset * Tests for manual combine passing * Tests for auto_combine now passing * xfailed weird behaviour with manual_combine trying to determine concat_dim * Add auto_combine and manual_combine to API page of docs * Tests now passing for open_mfdataset * Completed merge so that #2648 is respected, and added tests. Also moved concat to it's own file to avoid a circular dependency * Separated the tests for concat and both combines * Some PEP8 fixes * Pre-empting a test which will fail with opening uamiv format * Satisfy pep8speaks bot * Python 3.5 compatibile after changing some error string formatting * Order coords using pandas.Index objects * Fixed performance bug from GH #2662 * Removed ToDos about natural sorting of string coords * Generalized auto_combine to handle monotonically-decreasing coords too * Added more examples to docstring for manual_combine * Added note about globbing aspect of open_mfdataset * Removed auto-inferring of concatenation dimension in manual_combine * Added example to docstring for auto_combine * Minor correction to docstring * Another very minor docstring correction * Added test to guard against issue #2777 * Started deprecation cycle for auto_combine * Fully reverted open_mfdataset tests * Updated what's new to match deprecation cycle * Reverted uamiv test * Removed dependency on itertools * Deprecation tests fixed * Satisfy pycodestyle * Started deprecation cycle of auto_combine * Added specific error for edge case combine_manual can't handle * Check that global coordinates are monotonic * Highlighted weird behaviour when concatenating with no data variables * Added test for impossible-to-auto-combine coordinates * Removed uneeded test * Satisfy linter * Added airspeedvelocity benchmark for combining functions * Benchmark will take longer now * Updated version numbers in deprecation warnings to fit with recent release of 0.12 * Updated api docs for new function names * Fixed docs build failure * Revert "Fixed docs build failure" This reverts commit ddfc6dd. * Updated documentation with section explaining new functions * Suppressed deprecation warnings in test suite * Resolved ToDo by pointing to issue with concat, see #2975 * Various docs fixes * Slightly renamed tests to match new name of tested function * Included minor suggestions from shoyer * Removed trailing whitespace * Simplified error message for case combine_manual can't handle * Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim * Simple fixes suggested by shoyer * Change deprecation warning behaviour * linting
1 parent 76adf13 commit 6b33ad8

File tree

13 files changed

+2066
-1077
lines changed

13 files changed

+2066
-1077
lines changed

asv_bench/benchmarks/combine.py

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import numpy as np
2+
import xarray as xr
3+
4+
5+
class Combine:
6+
"""Benchmark concatenating and merging large datasets"""
7+
8+
def setup(self):
9+
"""Create 4 datasets with two different variables"""
10+
11+
t_size, x_size, y_size = 100, 900, 800
12+
t = np.arange(t_size)
13+
data = np.random.randn(t_size, x_size, y_size)
14+
15+
self.dsA0 = xr.Dataset(
16+
{'A': xr.DataArray(data, coords={'T': t},
17+
dims=('T', 'X', 'Y'))})
18+
self.dsA1 = xr.Dataset(
19+
{'A': xr.DataArray(data, coords={'T': t + t_size},
20+
dims=('T', 'X', 'Y'))})
21+
self.dsB0 = xr.Dataset(
22+
{'B': xr.DataArray(data, coords={'T': t},
23+
dims=('T', 'X', 'Y'))})
24+
self.dsB1 = xr.Dataset(
25+
{'B': xr.DataArray(data, coords={'T': t + t_size},
26+
dims=('T', 'X', 'Y'))})
27+
28+
def time_combine_manual(self):
29+
datasets = [[self.dsA0, self.dsA1], [self.dsB0, self.dsB1]]
30+
31+
xr.combine_manual(datasets, concat_dim=[None, 't'])
32+
33+
def time_auto_combine(self):
34+
"""Also has to load and arrange t coordinate"""
35+
datasets = [self.dsA0, self.dsA1, self.dsB0, self.dsB1]
36+
37+
xr.combine_auto(datasets)

doc/api.rst

+3
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ Top-level functions
1919
broadcast
2020
concat
2121
merge
22+
auto_combine
23+
combine_auto
24+
combine_manual
2225
where
2326
set_options
2427
full_like

doc/combining.rst

+76-2
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,10 @@ Combining data
1111
import xarray as xr
1212
np.random.seed(123456)
1313
14-
* For combining datasets or data arrays along a dimension, see concatenate_.
14+
* For combining datasets or data arrays along a single dimension, see concatenate_.
1515
* For combining datasets with different variables, see merge_.
1616
* For combining datasets or data arrays with different indexes or missing values, see combine_.
17+
* For combining datasets or data arrays along multiple dimensions see combining.multi_.
1718

1819
.. _concatenate:
1920

@@ -77,7 +78,7 @@ Merge
7778
~~~~~
7879

7980
To combine variables and coordinates between multiple ``DataArray`` and/or
80-
``Dataset`` object, use :py:func:`~xarray.merge`. It can merge a list of
81+
``Dataset`` objects, use :py:func:`~xarray.merge`. It can merge a list of
8182
``Dataset``, ``DataArray`` or dictionaries of objects convertible to
8283
``DataArray`` objects:
8384

@@ -237,3 +238,76 @@ coordinates as long as any non-missing values agree or are disjoint:
237238
Note that due to the underlying representation of missing values as floating
238239
point numbers (``NaN``), variable data type is not always preserved when merging
239240
in this manner.
241+
242+
.. _combining.multi:
243+
244+
Combining along multiple dimensions
245+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
246+
247+
.. note::
248+
249+
There are currently three combining functions with similar names:
250+
:py:func:`~xarray.auto_combine`, :py:func:`~xarray.combine_auto`, and
251+
:py:func:`~xarray.combine_manual`. This is because
252+
``auto_combine`` is in the process of being deprecated in favour of the other
253+
two functions, which are more general. If your code currently relies on
254+
``auto_combine``, then you will be able to get similar functionality by using
255+
``combine_manual``.
256+
257+
For combining many objects along multiple dimensions xarray provides
258+
:py:func:`~xarray.combine_manual`` and :py:func:`~xarray.combine_auto`. These
259+
functions use a combination of ``concat`` and ``merge`` across different
260+
variables to combine many objects into one.
261+
262+
:py:func:`~xarray.combine_manual`` requires specifying the order in which the
263+
objects should be combined, while :py:func:`~xarray.combine_auto` attempts to
264+
infer this ordering automatically from the coordinates in the data.
265+
266+
:py:func:`~xarray.combine_manual` is useful when you know the spatial
267+
relationship between each object in advance. The datasets must be provided in
268+
the form of a nested list, which specifies their relative position and
269+
ordering. A common task is collecting data from a parallelized simulation where
270+
each processor wrote out data to a separate file. A domain which was decomposed
271+
into 4 parts, 2 each along both the x and y axes, requires organising the
272+
datasets into a doubly-nested list, e.g:
273+
274+
.. ipython:: python
275+
276+
arr = xr.DataArray(name='temperature', data=np.random.randint(5, size=(2, 2)), dims=['x', 'y'])
277+
arr
278+
ds_grid = [[arr, arr], [arr, arr]]
279+
xr.combine_manual(ds_grid, concat_dim=['x', 'y'])
280+
281+
:py:func:`~xarray.combine_manual` can also be used to explicitly merge datasets
282+
with different variables. For example if we have 4 datasets, which are divided
283+
along two times, and contain two different variables, we can pass ``None``
284+
to ``'concat_dim'`` to specify the dimension of the nested list over which
285+
we wish to use ``merge`` instead of ``concat``:
286+
287+
.. ipython:: python
288+
289+
temp = xr.DataArray(name='temperature', data=np.random.randn(2), dims=['t'])
290+
precip = xr.DataArray(name='precipitation', data=np.random.randn(2), dims=['t'])
291+
ds_grid = [[temp, precip], [temp, precip]]
292+
xr.combine_manual(ds_grid, concat_dim=['t', None])
293+
294+
:py:func:`~xarray.combine_auto` is for combining objects which have dimension
295+
coordinates which specify their relationship to and order relative to one
296+
another, for example a linearly-increasing 'time' dimension coordinate.
297+
298+
Here we combine two datasets using their common dimension coordinates. Notice
299+
they are concatenated in order based on the values in their dimension
300+
coordinates, not on their position in the list passed to ``combine_auto``.
301+
302+
.. ipython:: python
303+
:okwarning:
304+
305+
x1 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [0, 1, 2])])
306+
x2 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [3, 4, 5])])
307+
xr.combine_auto([x2, x1])
308+
309+
These functions can be used by :py:func:`~xarray.open_mfdataset` to open many
310+
files as one dataset. The particular function used is specified by setting the
311+
argument ``'combine'`` to ``'auto'`` or ``'manual'``. This is useful for
312+
situations where your data is split across many files in multiple locations,
313+
which have some known relationship between one another.

doc/io.rst

+6-2
Original file line numberDiff line numberDiff line change
@@ -766,7 +766,10 @@ Combining multiple files
766766

767767
NetCDF files are often encountered in collections, e.g., with different files
768768
corresponding to different model runs. xarray can straightforwardly combine such
769-
files into a single Dataset by making use of :py:func:`~xarray.concat`.
769+
files into a single Dataset by making use of :py:func:`~xarray.concat`,
770+
:py:func:`~xarray.merge`, :py:func:`~xarray.combine_manual` and
771+
:py:func:`~xarray.combine_auto`. For details on the difference between these
772+
functions see :ref:`combining data`.
770773

771774
.. note::
772775

@@ -779,7 +782,8 @@ files into a single Dataset by making use of :py:func:`~xarray.concat`.
779782
This function automatically concatenates and merges multiple files into a
780783
single xarray dataset.
781784
It is the recommended way to open multiple files with xarray.
782-
For more details, see :ref:`dask.io` and a `blog post`_ by Stephan Hoyer.
785+
For more details, see :ref:`combining.multi`, :ref:`dask.io` and a
786+
`blog post`_ by Stephan Hoyer.
783787

784788
.. _dask: http://dask.pydata.org
785789
.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/

doc/whats-new.rst

+21
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,23 @@ Enhancements
5656
helpful for avoiding file-lock errors when trying to write to files opened
5757
using ``open_dataset()`` or ``open_dataarray()``. (:issue:`2887`)
5858
By `Dan Nowacki <https://github.com/dnowacki-usgs>`_.
59+
- Combining datasets along N dimensions:
60+
Datasets can now be combined along any number of dimensions,
61+
instead of just a one-dimensional list of datasets.
62+
63+
The new ``combine_manual`` will accept the datasets as a a nested
64+
list-of-lists, and combine by applying a series of concat and merge
65+
operations. The new ``combine_auto`` will instead use the dimension
66+
coordinates of the datasets to order them.
67+
68+
``open_mfdataset`` can use either ``combine_manual`` or ``combine_auto`` to
69+
combine datasets along multiple dimensions, by specifying the argument
70+
`combine='manual'` or `combine='auto'`.
71+
72+
This means that the original function ``auto_combine`` is being deprecated.
73+
To avoid FutureWarnings switch to using `combine_manual` or `combine_auto`,
74+
(or set the `combine` argument in `open_mfdataset`). (:issue:`2159`)
75+
By `Tom Nicholas <http://github.com/TomNicholas>`_.
5976
- Better warning message when supplying invalid objects to ``xr.merge``
6077
(:issue:`2948`). By `Mathias Hauser <https://github.com/mathause>`_.
6178
- Added ``strftime`` method to ``.dt`` accessor, making it simpler to hand a
@@ -203,6 +220,10 @@ Other enhancements
203220
report showing what exactly differs between the two objects (dimensions /
204221
coordinates / variables / attributes) (:issue:`1507`).
205222
By `Benoit Bovy <https://github.com/benbovy>`_.
223+
- Resampling of standard and non-standard calendars indexed by
224+
:py:class:`~xarray.CFTimeIndex` is now possible. (:issue:`2191`).
225+
By `Jwen Fai Low <https://github.com/jwenfai>`_ and
226+
`Spencer Clark <https://github.com/spencerkclark>`_.
206227
- Add ``tolerance`` option to ``resample()`` methods ``bfill``, ``pad``,
207228
``nearest``. (:issue:`2695`)
208229
By `Hauke Schulz <https://github.com/observingClouds>`_.

xarray/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@
77

88
from .core.alignment import align, broadcast, broadcast_arrays
99
from .core.common import full_like, zeros_like, ones_like
10-
from .core.combine import concat, auto_combine
10+
from .core.concat import concat
11+
from .core.combine import combine_auto, combine_manual, auto_combine
1112
from .core.computation import apply_ufunc, dot, where
1213
from .core.extensions import (register_dataarray_accessor,
1314
register_dataset_accessor)

0 commit comments

Comments
 (0)