Skip to content

Commit a0a3860

Browse files
rabernatshoyer
authored andcommitted
Multidimensional groupby (#818)
* multidimensional groupby and binning * added time dimension to multidim groupby tests * updated docs * fixed binning * add groupby_bins method * doc update * test for non-monotonic 2d coordinates * bin coordinate name changed * updated docs and example * fixed style issues and whats-new
1 parent 0d0ae9d commit a0a3860

14 files changed

+773
-19
lines changed

doc/api.rst

+2
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ Computation
109109
Dataset.apply
110110
Dataset.reduce
111111
Dataset.groupby
112+
Dataset.groupby_bins
112113
Dataset.resample
113114
Dataset.diff
114115

@@ -245,6 +246,7 @@ Computation
245246

246247
DataArray.reduce
247248
DataArray.groupby
249+
DataArray.groupby_bins
248250
DataArray.rolling
249251
DataArray.resample
250252
DataArray.get_axis_num

doc/examples.rst

+1
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ Examples
77
examples/quick-overview
88
examples/weather-data
99
examples/monthly-means
10+
examples/multidimensional-coords
+201
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
.. _examples.multidim:
2+
3+
Working with Multidimensional Coordinates
4+
=========================================
5+
6+
Author: `Ryan Abernathey <http://github.org/rabernat>`__
7+
8+
Many datasets have *physical coordinates* which differ from their
9+
*logical coordinates*. Xarray provides several ways to plot and analyze
10+
such datasets.
11+
12+
.. code:: python
13+
14+
%matplotlib inline
15+
import numpy as np
16+
import pandas as pd
17+
import xarray as xr
18+
import cartopy.crs as ccrs
19+
from matplotlib import pyplot as plt
20+
21+
print("numpy version : ", np.__version__)
22+
print("pandas version : ", pd.__version__)
23+
print("xarray version : ", xr.version.version)
24+
25+
26+
.. parsed-literal::
27+
28+
('numpy version : ', '1.11.0')
29+
('pandas version : ', u'0.18.0')
30+
('xarray version : ', '0.7.2-32-gf957eb8')
31+
32+
33+
As an example, consider this dataset from the
34+
`xarray-data <https://github.com/pydata/xarray-data>`__ repository.
35+
36+
.. code:: python
37+
38+
! curl -L -O https://github.com/pydata/xarray-data/raw/master/RASM_example_data.nc
39+
40+
.. code:: python
41+
42+
ds = xr.open_dataset('RASM_example_data.nc')
43+
ds
44+
45+
46+
47+
48+
.. parsed-literal::
49+
50+
<xarray.Dataset>
51+
Dimensions: (time: 36, x: 275, y: 205)
52+
Coordinates:
53+
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
54+
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
55+
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
56+
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
57+
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
58+
Data variables:
59+
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
60+
Attributes:
61+
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
62+
institution: U.W.
63+
source: RACM R1002RBRxaaa01a
64+
output_frequency: daily
65+
output_mode: averaged
66+
convention: CF-1.4
67+
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
68+
comment: Output from the Variable Infiltration Capacity (VIC) model.
69+
nco_openmp_thread_number: 1
70+
NCO: 4.3.7
71+
history: history deleted for brevity
72+
73+
74+
75+
In this example, the *logical coordinates* are ``x`` and ``y``, while
76+
the *physical coordinates* are ``xc`` and ``yc``, which represent the
77+
latitudes and longitude of the data.
78+
79+
.. code:: python
80+
81+
print(ds.xc.attrs)
82+
print(ds.yc.attrs)
83+
84+
85+
.. parsed-literal::
86+
87+
OrderedDict([(u'long_name', u'longitude of grid cell center'), (u'units', u'degrees_east'), (u'bounds', u'xv')])
88+
OrderedDict([(u'long_name', u'latitude of grid cell center'), (u'units', u'degrees_north'), (u'bounds', u'yv')])
89+
90+
91+
Plotting
92+
--------
93+
94+
Let's examine these coordinate variables by plotting them.
95+
96+
.. code:: python
97+
98+
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14,4))
99+
ds.xc.plot(ax=ax1)
100+
ds.yc.plot(ax=ax2)
101+
102+
103+
104+
105+
.. parsed-literal::
106+
107+
<matplotlib.collections.QuadMesh at 0x118688fd0>
108+
109+
110+
111+
.. parsed-literal::
112+
113+
/Users/rpa/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
114+
if self._edgecolors == str('face'):
115+
116+
117+
118+
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_8_2.png
119+
120+
121+
Note that the variables ``xc`` (longitude) and ``yc`` (latitude) are
122+
two-dimensional scalar fields.
123+
124+
If we try to plot the data variable ``Tair``, by default we get the
125+
logical coordinates.
126+
127+
.. code:: python
128+
129+
ds.Tair[0].plot()
130+
131+
132+
133+
134+
.. parsed-literal::
135+
136+
<matplotlib.collections.QuadMesh at 0x11b6da890>
137+
138+
139+
140+
141+
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_10_1.png
142+
143+
144+
In order to visualize the data on a conventional latitude-longitude
145+
grid, we can take advantage of xarray's ability to apply
146+
`cartopy <http://scitools.org.uk/cartopy/index.html>`__ map projections.
147+
148+
.. code:: python
149+
150+
plt.figure(figsize=(14,6))
151+
ax = plt.axes(projection=ccrs.PlateCarree())
152+
ax.set_global()
153+
ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x='xc', y='yc', add_colorbar=False)
154+
ax.coastlines()
155+
ax.set_ylim([0,90]);
156+
157+
158+
159+
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_12_0.png
160+
161+
162+
Multidimensional Groupby
163+
------------------------
164+
165+
The above example allowed us to visualize the data on a regular
166+
latitude-longitude grid. But what if we want to do a calculation that
167+
involves grouping over one of these physical coordinates (rather than
168+
the logical coordinates), for example, calculating the mean temperature
169+
at each latitude. This can be achieved using xarray's ``groupby``
170+
function, which accepts multidimensional variables. By default,
171+
``groupby`` will use every unique value in the variable, which is
172+
probably not what we want. Instead, we can use the ``groupby_bins``
173+
function to specify the output coordinates of the group.
174+
175+
.. code:: python
176+
177+
# define two-degree wide latitude bins
178+
lat_bins = np.arange(0,91,2)
179+
# define a label for each bin corresponding to the central latitude
180+
lat_center = np.arange(1,90,2)
181+
# group according to those bins and take the mean
182+
Tair_lat_mean = ds.Tair.groupby_bins('xc', lat_bins, labels=lat_center).mean()
183+
# plot the result
184+
Tair_lat_mean.plot()
185+
186+
187+
188+
189+
.. parsed-literal::
190+
191+
[<matplotlib.lines.Line2D at 0x11cb92e90>]
192+
193+
194+
195+
196+
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_14_1.png
197+
198+
199+
Note that the resulting coordinate for the ``groupby_bins`` operation
200+
got the ``_bins`` suffix appended: ``xc_bins``. This help us distinguish
201+
it from the original multidimensional variable ``xc``.
Loading
Loading
Loading
Loading

doc/groupby.rst

+62-4
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,11 @@ __ http://www.jstatsoft.org/v40/i01/paper
1414
- Combine your groups back into a single data object.
1515

1616
Group by operations work on both :py:class:`~xarray.Dataset` and
17-
:py:class:`~xarray.DataArray` objects. Currently, you can only group by a single
18-
one-dimensional variable (eventually, we hope to remove this limitation). Also,
19-
note that for one-dimensional data, it is usually faster to rely on pandas'
20-
implementation of the same pipeline.
17+
:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
18+
a single one-dimensional variable, although support for grouping
19+
over a multi-dimensional variable has recently been implemented. Note that for
20+
one-dimensional data, it is usually faster to rely on pandas' implementation of
21+
the same pipeline.
2122

2223
Split
2324
~~~~~
@@ -63,6 +64,33 @@ You can also iterate over over groups in ``(label, group)`` pairs:
6364
Just like in pandas, creating a GroupBy object is cheap: it does not actually
6465
split the data until you access particular values.
6566

67+
Binning
68+
~~~~~~~
69+
70+
Sometimes you don't want to use all the unique values to determine the groups
71+
but instead want to "bin" the data into coarser groups. You could always create
72+
a customized coordinate, but xarray facilitates this via the
73+
:py:meth:`~xarray.Dataset.groupby_bins` method.
74+
75+
.. ipython:: python
76+
77+
x_bins = [0,25,50]
78+
ds.groupby_bins('x', x_bins).groups
79+
80+
The binning is implemented via `pandas.cut`__, whose documentation details how
81+
the bins are assigned. As seen in the example above, by default, the bins are
82+
labeled with strings using set notation to precisely identify the bin limits. To
83+
override this behavior, you can specify the bin labels explicitly. Here we
84+
choose `float` labels which identify the bin centers:
85+
86+
.. ipython:: python
87+
88+
x_bin_labels = [12.5,37.5]
89+
ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups
90+
91+
__ http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.cut.html
92+
93+
6694
Apply
6795
~~~~~
6896

@@ -149,3 +177,33 @@ guarantee that all original dimensions remain unchanged.
149177

150178
You can always squeeze explicitly later with the Dataset or DataArray
151179
:py:meth:`~xarray.DataArray.squeeze` methods.
180+
181+
.. _groupby.multidim:
182+
183+
Multidimensional Grouping
184+
~~~~~~~~~~~~~~~~~~~~~~~~~
185+
186+
Many datasets have a multidimensional coordinate variable (e.g. longitude)
187+
which is different from the logical grid dimensions (e.g. nx, ny). Such
188+
variables are valid under the `CF conventions`__. Xarray supports groupby
189+
operations over multidimensional coordinate variables:
190+
191+
__ http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables
192+
193+
.. ipython:: python
194+
195+
da = xr.DataArray([[0,1],[2,3]],
196+
coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
197+
'lat': (['ny','nx'], [[10,10],[20,20]] ),},
198+
dims=['ny','nx'])
199+
da
200+
da.groupby('lon').sum()
201+
da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)
202+
203+
Because multidimensional groups have the ability to generate a very large
204+
number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
205+
may be desirable:
206+
207+
.. ipython:: python
208+
209+
da.groupby_bins('lon', [0,45,50]).sum()

doc/whats-new.rst

+6
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,12 @@ Breaking changes
3030
Enhancements
3131
~~~~~~~~~~~~
3232

33+
- Groupby operations now support grouping over multidimensional variables. A new
34+
method called :py:meth:`~xarray.Dataset.groupby_bins` has also been added to
35+
allow users to specify bins for grouping. The new features are described in
36+
:ref:`groupby.multidim` and :ref:`examples.multidim`.
37+
By `Ryan Abernathey <http://github.com/rabernat>`_.
38+
3339
- DataArray and Dataset method :py:meth:`where` now supports a ``drop=True``
3440
option that clips coordinate elements that are fully masked. By
3541
`Phillip J. Wolfram <https://github.com/pwolfram>`_.

0 commit comments

Comments
 (0)