Return a scalar instead of DataArray when the return value is a scalar #987

joonro · 2016-08-26T16:37:36Z

Hi,

I'm not sure how devs will feel about this, but I wanted to ask because I'm getting into this issue frequently.

Currently many methods such as .min(), .max(), .mean() returns a DataArray even for the cases where the return value is a scaler. For example,

import numpy as np
import xarray as xr
test = xr.DataArray(data=np.ones((10, 10)))

In [6]: test.min()
Out[6]: 
<xarray.DataArray ()>
array(1.0)

which makes a lot of other things break down and I have to use test.min().values or float(test.min()).
I think it would be great that these methods return a scalar when the return value is a scaler. For example,

In [7]: np.ones((10, 10)).mean()
Out[7]: 1.0

Thank you!

The text was updated successfully, but these errors were encountered:

shoyer · 2016-08-26T16:51:02Z

I agree that this can be annoying. The downside in making this switch is that we would lose xarray specific fields like coords and attrs that are currently preserved, e.g.,

>>> array = xr.DataArray([1, 2, 3], coords=[('x', ['a', 'b', 'c'])])
>>> array
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Coordinates:
  * x        (x) |S1 'a' 'b' 'c'
>>> array[0]
<xarray.DataArray ()>
array(1)
Coordinates:
    x        |S1 'a'
>>> array[0].coords['x'].item()
'a'

Also, strictly from a simplicity point of view for xarray, it's nice for every function to return fixed types.

NumPy solved this problem by creating it's own scalar types (e.g., np.float64) that define fields like shape and dtype while also subclassing Python's builtin numeric types. We could do the same, but this could lead to a different set of subtle cross-compatibility issues.

joonro · 2016-08-26T17:22:06Z

I see - thanks a lot for the quick response. I knew there was a good reason for this.

I wonder if it is reasonable to return a scalar when there is neither coords nor attrs associated with the return value, or it would be too much ad-hoc thing. For example, in the original example the return value was <xarray.DataArray ()>, which does not have any useful information.

I think this might be reasonable because I only get into this issue when I'm doing an array-wide operation and I know I'm going to get an aggregate scalar and forget to use .values.

shoyer · 2016-08-26T17:34:37Z

I wonder if it is reasonable to return a scalar when there is neither coords nor attrs associated with the return value, or it would be too much ad-hoc thing. For example, in the original example the return value was <xarray.DataArray ()>, which does not have any useful information.

This is a bad path to go down :). Now your code might suddenly break when you add a metadata field!

In principle, we could pick some subset of operations for which to always do this and others for which to never do this (e.g., aggregating out all dimensions, but not indexing out all dimensions), but I think this inconsistency would be even more surprising. It's pretty easy to see how this could lead to bugs, too. At least now you know you always need to type .values or .item()!

darothen · 2016-08-27T11:34:28Z

@joonro, I think there's a strong case to be made about returning a DataArray with some metadata appended. Referring to the latest draft of the CF Metadata Conventions, there is a clear way to indicate when operations such as mean, max, or min have been applied to a variable by using the cell_methods attribute.

It might be more prudent to add this attribute whenever we apply these operations to a DataArray (or perhaps variable-wise when applied to a Dataset). That way, there is a clear reason to not return a scalar - the documentation of what operations were applied to produce that final result.

I can whip up a working example/pull request if people think this is a direction to go. I'd probably build a decorator which handles inspection of the operator name and arguments and uses that to add the cell_methods attribute, that way people can add the same functionality to homegrown methods/operators.

shoyer · 2016-08-27T19:48:42Z

@darothen Let's discuss this over in #988.

joonro · 2016-08-27T20:06:12Z

Thanks a lot for the discussions. I agree it is very important to be consistent and explicit. Another thing was that sometimes .values makes a line of code really long - especially when I want to index a DataArray with another DataArray with some conditions, as I often have to use .values for each of them.

Currently I do not have a good idea about how to improve this - I will report back if one occurs to me. Thanks again!

shoyer · 2016-08-27T20:07:45Z

Can you give an example of how you need to use .values in xarray
operations? Within xarray, we should be able to remove the need to use that.
On Sat, Aug 27, 2016 at 1:06 PM Joon Ro [email protected] wrote:

Thanks a lot for the discussions. I agree it is very important to be
consistent and explicit. Another thing was that sometimes .values makes a
line of code really long - especially when I want to index a DataArray
with another DataArray with some conditions, as I often have to use
.values for each of them.

Currently I do not have a good idea about how to improve this - I will
report back if one occurs to me. Thanks again!

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#987 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1rjz_au5Uth5UsSgZpSqTXq7sYeyks5qkJi0gaJpZM4JuQKC
.

joonro · 2016-08-27T20:55:59Z

Sure. My actual usage is usually much more complicated, but basically, with

import numpy as np
import xarray as xr
X = xr.DataArray(np.random.normal(size=(10, 10)),
                 coords=[range(10), range(10)],)

if I want to choose only values larger than 0 from X, it seems I cannot do X[X > 0], I have to do X.values[X.values > 0]. You can see how this thing can quickly get long if I'm doing this for assignment with multidimensional xarrays - something like

X.loc[:, :, :, 'variable'].values[X.loc[:, :, :, 'variable'].values > 0] = Y.loc[:, :, :, 'variable'].values[Y.loc[:, :, :, 'variable'].values > 0]

Maybe I'm mistaken and there is a way to do this more nicely, but I haven't been able to figure it out.

Thank you!

shoyer · 2016-08-28T01:22:45Z

@joonro Yes, this does get messy. We'll eventually support indexing like X[X > 0] directly, which will help significantly.

In the meantime, you can still break things up onto multiple lines by saving temporary variables:

condition = X.loc[..., 'variable'].values > 0
X.loc[..., 'variable'].values[condition] = Y.loc[..., 'variable'].values[condition]

Using abbreviations like ... for :, :, : (assuming 'variable' is along the last axis) can also help.

joonro · 2016-08-28T02:58:24Z

@shoyer I think I saw ... a long time ago and must have forgotten about it. Thank you so much for reminding me - I was really hoping for something like ... for a while.

Btw, I must say not only that xarray is just so useful for many of my research, but also the devs' responses on the issues have been superb. Definitely one of the most pleasant experiences I have had with developers. Thank you.

shoyer · 2016-08-28T20:19:18Z

Thanks @joonro, you are very kind!

I'm going to close this issue since I think we resolved the original question.

shoyer added the API design label Aug 26, 2016

shoyer mentioned this issue Aug 27, 2016

Hooks for custom attribute handling in xarray operations #988

Open

shoyer closed this as completed Aug 28, 2016

charles-turner-1 mentioned this issue Nov 18, 2024

[DATA REQUEST] Add COSIMA Panantarctic / GFDL_OM4 Builder & Data ACCESS-NRI/access-nri-intake-catalog#175

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return a scalar instead of DataArray when the return value is a scalar #987

Return a scalar instead of DataArray when the return value is a scalar #987

joonro commented Aug 26, 2016

shoyer commented Aug 26, 2016 •

edited

Loading

joonro commented Aug 26, 2016

shoyer commented Aug 26, 2016

darothen commented Aug 27, 2016

shoyer commented Aug 27, 2016

joonro commented Aug 27, 2016

shoyer commented Aug 27, 2016

joonro commented Aug 27, 2016 •

edited

Loading

shoyer commented Aug 28, 2016

joonro commented Aug 28, 2016

shoyer commented Aug 28, 2016

Return a scalar instead of DataArray when the return value is a scalar #987

Return a scalar instead of DataArray when the return value is a scalar #987

Comments

joonro commented Aug 26, 2016

shoyer commented Aug 26, 2016 • edited Loading

joonro commented Aug 26, 2016

shoyer commented Aug 26, 2016

darothen commented Aug 27, 2016

shoyer commented Aug 27, 2016

joonro commented Aug 27, 2016

shoyer commented Aug 27, 2016

joonro commented Aug 27, 2016 • edited Loading

shoyer commented Aug 28, 2016

joonro commented Aug 28, 2016

shoyer commented Aug 28, 2016

shoyer commented Aug 26, 2016 •

edited

Loading

joonro commented Aug 27, 2016 •

edited

Loading