Skip to content

Return a scalar instead of DataArray when the return value is a scalar #987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joonro opened this issue Aug 26, 2016 · 11 comments
Closed

Return a scalar instead of DataArray when the return value is a scalar #987

joonro opened this issue Aug 26, 2016 · 11 comments

Comments

@joonro
Copy link

joonro commented Aug 26, 2016

Hi,

I'm not sure how devs will feel about this, but I wanted to ask because I'm getting into this issue frequently.

Currently many methods such as .min(), .max(), .mean() returns a DataArray even for the cases where the return value is a scaler. For example,

import numpy as np
import xarray as xr
test = xr.DataArray(data=np.ones((10, 10)))

In [6]: test.min()
Out[6]: 
<xarray.DataArray ()>
array(1.0)

which makes a lot of other things break down and I have to use test.min().values or float(test.min()).
I think it would be great that these methods return a scalar when the return value is a scaler. For example,

In [7]: np.ones((10, 10)).mean()
Out[7]: 1.0

Thank you!

@shoyer
Copy link
Member

shoyer commented Aug 26, 2016

I agree that this can be annoying. The downside in making this switch is that we would lose xarray specific fields like coords and attrs that are currently preserved, e.g.,

>>> array = xr.DataArray([1, 2, 3], coords=[('x', ['a', 'b', 'c'])])
>>> array
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Coordinates:
  * x        (x) |S1 'a' 'b' 'c'
>>> array[0]
<xarray.DataArray ()>
array(1)
Coordinates:
    x        |S1 'a'
>>> array[0].coords['x'].item()
'a'

Also, strictly from a simplicity point of view for xarray, it's nice for every function to return fixed types.

NumPy solved this problem by creating it's own scalar types (e.g., np.float64) that define fields like shape and dtype while also subclassing Python's builtin numeric types. We could do the same, but this could lead to a different set of subtle cross-compatibility issues.

@joonro
Copy link
Author

joonro commented Aug 26, 2016

I see - thanks a lot for the quick response. I knew there was a good reason for this.

I wonder if it is reasonable to return a scalar when there is neither coords nor attrs associated with the return value, or it would be too much ad-hoc thing. For example, in the original example the return value was <xarray.DataArray ()>, which does not have any useful information.

I think this might be reasonable because I only get into this issue when I'm doing an array-wide operation and I know I'm going to get an aggregate scalar and forget to use .values.

@shoyer
Copy link
Member

shoyer commented Aug 26, 2016

I wonder if it is reasonable to return a scalar when there is neither coords nor attrs associated with the return value, or it would be too much ad-hoc thing. For example, in the original example the return value was <xarray.DataArray ()>, which does not have any useful information.

This is a bad path to go down :). Now your code might suddenly break when you add a metadata field!

In principle, we could pick some subset of operations for which to always do this and others for which to never do this (e.g., aggregating out all dimensions, but not indexing out all dimensions), but I think this inconsistency would be even more surprising. It's pretty easy to see how this could lead to bugs, too. At least now you know you always need to type .values or .item()!

@darothen
Copy link

@joonro, I think there's a strong case to be made about returning a DataArray with some metadata appended. Referring to the latest draft of the CF Metadata Conventions, there is a clear way to indicate when operations such as mean, max, or min have been applied to a variable by using the cell_methods attribute.

It might be more prudent to add this attribute whenever we apply these operations to a DataArray (or perhaps variable-wise when applied to a Dataset). That way, there is a clear reason to not return a scalar - the documentation of what operations were applied to produce that final result.

I can whip up a working example/pull request if people think this is a direction to go. I'd probably build a decorator which handles inspection of the operator name and arguments and uses that to add the cell_methods attribute, that way people can add the same functionality to homegrown methods/operators.

@shoyer
Copy link
Member

shoyer commented Aug 27, 2016

@darothen Let's discuss this over in #988.

@joonro
Copy link
Author

joonro commented Aug 27, 2016

Thanks a lot for the discussions. I agree it is very important to be consistent and explicit. Another thing was that sometimes .values makes a line of code really long - especially when I want to index a DataArray with another DataArray with some conditions, as I often have to use .values for each of them.

Currently I do not have a good idea about how to improve this - I will report back if one occurs to me. Thanks again!

@shoyer
Copy link
Member

shoyer commented Aug 27, 2016

Can you give an example of how you need to use .values in xarray
operations? Within xarray, we should be able to remove the need to use that.
On Sat, Aug 27, 2016 at 1:06 PM Joon Ro [email protected] wrote:

Thanks a lot for the discussions. I agree it is very important to be
consistent and explicit. Another thing was that sometimes .values makes a
line of code really long - especially when I want to index a DataArray
with another DataArray with some conditions, as I often have to use
.values for each of them.

Currently I do not have a good idea about how to improve this - I will
report back if one occurs to me. Thanks again!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#987 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1rjz_au5Uth5UsSgZpSqTXq7sYeyks5qkJi0gaJpZM4JuQKC
.

@joonro
Copy link
Author

joonro commented Aug 27, 2016

Sure. My actual usage is usually much more complicated, but basically, with

import numpy as np
import xarray as xr
X = xr.DataArray(np.random.normal(size=(10, 10)),
                 coords=[range(10), range(10)],)

if I want to choose only values larger than 0 from X, it seems I cannot do X[X > 0], I have to do X.values[X.values > 0]. You can see how this thing can quickly get long if I'm doing this for assignment with multidimensional xarrays - something like

X.loc[:, :, :, 'variable'].values[X.loc[:, :, :, 'variable'].values > 0] = Y.loc[:, :, :, 'variable'].values[Y.loc[:, :, :, 'variable'].values > 0] 

Maybe I'm mistaken and there is a way to do this more nicely, but I haven't been able to figure it out.

Thank you!

@shoyer
Copy link
Member

shoyer commented Aug 28, 2016

@joonro Yes, this does get messy. We'll eventually support indexing like X[X > 0] directly, which will help significantly.

In the meantime, you can still break things up onto multiple lines by saving temporary variables:

condition = X.loc[..., 'variable'].values > 0
X.loc[..., 'variable'].values[condition] = Y.loc[..., 'variable'].values[condition]

Using abbreviations like ... for :, :, : (assuming 'variable' is along the last axis) can also help.

@joonro
Copy link
Author

joonro commented Aug 28, 2016

@shoyer I think I saw ... a long time ago and must have forgotten about it. Thank you so much for reminding me - I was really hoping for something like ... for a while.

Btw, I must say not only that xarray is just so useful for many of my research, but also the devs' responses on the issues have been superb. Definitely one of the most pleasant experiences I have had with developers. Thank you.

@shoyer
Copy link
Member

shoyer commented Aug 28, 2016

Thanks @joonro, you are very kind!

I'm going to close this issue since I think we resolved the original question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants