Skip to content

implement interp() #2104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jun 8, 2018
Merged

implement interp() #2104

merged 56 commits into from
Jun 8, 2018

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented May 4, 2018

  • Closes New feature: interp1d #2079 (remove if there is no corresponding issue, which should only be the case for minor changes)
  • Tests added (for all bug fixes or enhancements)
  • Tests passed (for all non-documentation changes)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

I started working to add interpolate_at to xarray, as discussed in issue #2079 (but without caching).

I think I need to take care of more edge cases, but before finishing up this PR, I want to discuss what the best API is.

I would like to this method working similar to isel, which may support vectorized interpolation.
Currently, this works as follwos

In [1]: import numpy as np
   ...: import xarray as xr
   ...: 
   ...: da = xr.DataArray([0, 0.1, 0.2, 0.1], dims='x', coords={'x': [0, 1, 2, 3]})
   ...: 

In [2]: # simple linear interpolation
   ...: da.interpolate_at(x=[0.5, 1.5])
   ...: 
Out[2]: 
<xarray.DataArray (x: 2)>
array([0.05, 0.15])
Coordinates:
  * x        (x) float64 0.5 1.5

In [3]: # with cubic spline interpolation
   ...: da.interpolate_at(x=[0.5, 1.5], method='cubic')
   ...: 
Out[3]: 
<xarray.DataArray (x: 2)>
array([0.0375, 0.1625])
Coordinates:
  * x        (x) float64 0.5 1.5

In [4]: # interpolation at one single position
   ...: da.interpolate_at(x=0.5)
   ...: 
Out[4]: 
<xarray.DataArray ()>
array(0.05)
Coordinates:
    x        float64 0.5

In [5]: # interpolation with broadcasting
   ...: da.interpolate_at(x=xr.DataArray([[0.5, 1.0], [1.5, 2.0]], dims=['y', 'z']))
   ...: 
Out[5]: 
<xarray.DataArray (y: 2, z: 2)>
array([[0.05, 0.1 ],
       [0.15, 0.2 ]])
Coordinates:
    x        (y, z) float64 0.5 1.0 1.5 2.0
Dimensions without coordinates: y, z

In [6]: da = xr.DataArray([[0, 0.1, 0.2], [1.0, 1.1, 1.2]], 
   ...:                   dims=['x', 'y'], 
   ...:                   coords={'x': [0, 1], 'y': [0, 10, 20]})
   ...: 

In [7]: # multidimensional interpolation
   ...: da.interpolate_at(x=[0.5, 1.5], y=[5, 15])
   ...: 
Out[7]: 
<xarray.DataArray (x: 2, y: 2)>
array([[0.55, 0.65],
       [ nan,  nan]])
Coordinates:
  * x        (x) float64 0.5 1.5
  * y        (y) int64 5 15

In [8]: # multidimensional interpolation with broadcasting
   ...: da.interpolate_at(x=xr.DataArray([0.5, 1.5], dims='z'), 
   ...:                   y=xr.DataArray([5, 15], dims='z'))
   ...: 
Out[8]: 
<xarray.DataArray (z: 2)>
array([0.55,  nan])
Coordinates:
    x        (z) float64 0.5 1.5
    y        (z) int64 5 15
Dimensions without coordinates: z

Design question

  1. How many interpolate methods should we support?
    Currently, I only implemented scipy.interpolate.interp1d for 1dimensional interpolation and scipy.interpolate.RegularGridInterpolator for multidimensional interpolation.
    I think 90% usecases are linear, but there are more methods in scipy.

  2. How do we handle nan?
    Currently this raises ValueError if nan is present. It may be possible to carry out the interpolation with skipping nan, but in this case the performance would be significantly drops because it cannot be vectorized.

  3. Do we support interpolation along dimension without coordinate?
    In that case, do we attach new coordinate to the object?

  4. How should we do if new coordinate has the dimensional coordinate for the dimension to be interpolated?
    e.g. in the following case,

da = xr.DataArray([0, 0.1, 0.2, 0.1], dims='x', coords={'x': [0, 1, 2, 3]})
rslt = da.interpolate_at(x=xr.DataArray([0.5, 1.5], dims=['x'], coords={'x': [1, 3]})

what would be rslt['x']?

I appreciate any comments.

@@ -6,7 +6,8 @@
import numpy as np
import pandas as pd

from . import computation, groupby, indexing, ops, resample, rolling, utils
from . import (computation, dtypes, groupby, indexing, ops, resample, rolling,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 '.dtypes' imported but unused


import numpy as np
from .computation import apply_ufunc
from .pycompat import (OrderedDict, dask_array_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 '.pycompat.OrderedDict' imported but unused

imin = x.to_index().get_loc(np.min(new_x), method='ffill')
imax = x.to_index().get_loc(np.max(new_x), method='bfill')

idx = slice(np.maximum(imin-1, 0), imax+1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator


_assert_single_chunks(obj, [-1])
chunks = obj.chunks[:-len(x)] + new_x[0].shape
drop_axis = range(obj.ndim-len(x), obj.ndim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator

_assert_single_chunks(obj, [-1])
chunks = obj.chunks[:-len(x)] + new_x[0].shape
drop_axis = range(obj.ndim-len(x), obj.ndim)
new_axis = range(obj.ndim-len(x), obj.ndim-len(x)+new_x[0].ndim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator

xarray/interp.py Outdated
array: np.ndarray or da.ndarray. Should not contain nan.
old_coords, new_coords: list of np.ndarrays.
"""
from scipy import interpolate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'scipy.interpolate' imported but unused

xarray/interp.py Outdated



def ghosted_interp_func(array, old_coords, new_coords, method):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E303 too many blank lines (3)

x = np.linspace(0, 1, 100)
y = np.linspace(0, 0.1, 30)
data = xr.DataArray(
np.sin(x[:, np.newaxis])*np.cos(y), dims=['x', 'y'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator

y = np.linspace(0, 0.1, 30)
z = np.linspace(0.1, 0.2, 10)
return xr.DataArray(
np.sin(x[:, np.newaxis, np.newaxis])*np.cos(y[:, np.newaxis])*z,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator


if dim == 'y' and case == 1:
with pytest.raises(ValueError):
actual = da.interpolate_at(**{dim: xdest}, method=method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E999 SyntaxError: invalid syntax

@fmaussion
Copy link
Member

@fujiisoup this is awesome! So glad you are looking into this.

To point 1: How many interpolate methods should we support?.

I think that your implementation is a very good first step. From the geosciences point of view I think that Spline interpolation would be a very good candidate too, but it only works in the 2-dimensional case which makes it a no-go for the general case...

@jhamman
Copy link
Member

jhamman commented May 4, 2018

@fujiisoup - This is great. I wonder if we can draw more from the interpolator adapters I built in missing.py:

def _get_interpolator(method, **kwargs):
'''helper function to select the appropriate interpolator class
returns a partial of wrap_interpolator
'''
interp1d_methods = ['linear', 'nearest', 'zero', 'slinear', 'quadratic',
'cubic', 'polynomial']
valid_methods = interp1d_methods + ['barycentric', 'krog', 'pchip',
'spline', 'akima']
if (method == 'linear' and not
kwargs.get('fill_value', None) == 'extrapolate'):
kwargs.update(method=method)
interp_class = NumpyInterpolator
elif method in valid_methods:
try:
from scipy import interpolate
except ImportError:
raise ImportError(
'Interpolation with method `%s` requires scipy' % method)
if method in interp1d_methods:
kwargs.update(method=method)
interp_class = ScipyInterpolator
elif method == 'barycentric':
interp_class = interpolate.BarycentricInterpolator
elif method == 'krog':
interp_class = interpolate.KroghInterpolator
elif method == 'pchip':
interp_class = interpolate.PchipInterpolator
elif method == 'spline':
kwargs.update(method=method)
interp_class = SplineInterpolator
elif method == 'akima':
interp_class = interpolate.Akima1DInterpolator
else:
raise ValueError('%s is not a valid scipy interpolator' % method)
else:
raise ValueError('%s is not a valid interpolator' % method)
return partial(wrap_interpolator, interp_class, **kwargs)

I just gave your implementation a quick skim and I think there is currently some duplication of efforts so maybe we look there to see what can be combined.

@shoyer
Copy link
Member

shoyer commented May 4, 2018

First of all, this is awesome!

One question: since I think this new interpolation method will be used quite frequently (more often than interpolate_na and reindex), perhaps we should give it a shorter name. Maybe interp?

We should also consider adding interp_like or interpolate_like by analogy to reindex_like.

I would like to this method working similar to isel, which may support vectorized interpolation.

Yes, I agree. Interpolation should be vectorized similarly to isel/sel.

In particular, in the long term I think we should aim to make interpolate_at(..., method='nearest') and reindex(..., method='nearest') equivalent.

How many interpolate methods should we support?

I think this is fine to start (we can always add more later).

How do we handle nan?

Which NaN are you referring to?

  1. data on the arrays being indexed
  2. Coordinates on the arrays being indexed
  3. Points at which to interpolate

Case (1) should definitely be supported. Especially if data is stored in dask arrays, we cannot necessarily check if there are NaNs.

Cases (2) and (3) are not important, because there are relatively few use-cases for NaN in coordinate arrays.

Do we support interpolation along dimension without coordinate?
In that case, do we attach new coordinate to the object?

Currently we don't support this for reindex(), but I suppose we could do so unambiguous.

How should we do if new coordinate has the dimensional coordinate for the dimension to be interpolated?

I think the new coordinates should take priority, and the dimension coordinates on the new coordinate should be dropped. This is similar to what we do for sel():

In [10]: da = xr.DataArray([0, 0.1, 0.2, 0.1], dims='x', coords={'x': [0, 1, 2, 3]})
    ...:

In [11]: da.sel(x=xr.DataArray([1, 2], dims=['x'], coords={'x': [1, 3]}))
    ...:
Out[11]:
<xarray.DataArray (x: 2)>
array([0.1, 0.2])
Coordinates:
  * x        (x) int64 1 2

@fujiisoup
Copy link
Member Author

@jhamman,

I just gave your implementation a quick skim and I think there is currently some duplication of efforts so maybe we look there to see what can be combined.

Agreed. I will try to generalize your interpolator-adaptor more.

@shoyer

perhaps we should give it a shorter name. Maybe interp?

Agreed. interp sounds nice.

Which NaN are you referring to?

  1. data on the arrays being indexed
  2. Coordinates on the arrays being indexed
  3. Points at which to interpolate

I meant 1.

Scipy.interpolate.interp1d does not support arrays with NaN.
So if we want to support this, we need to interpolate the array line by line as interpolate_na does, i.e. no vectorization.
For the case of multidimensional interpolation of arrays with NaN, it is equivalent with the interpolation on non-grided data. It is possible, but the performance will significantly drop.


idx = slice(np.maximum(imin - 1, 0), imax + 1)
index_coord[dim] = (x[idx], new_x)
obj = obj.isel(**{dim: idx})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, only a small portion of arrays will be used for the interpolation with method='linear' / 'nearest '. So if there are no NaN in this region, we can interpolate this array.

@fujiisoup
Copy link
Member Author

Maybe the possible API would be

da.interp(x=[0, 1, 2], skipna=True)

for nan-skipping interpolation and

da.interp(x=[0, 1, 2], skipna=False)

for faster interpolation.

@shoyer
Copy link
Member

shoyer commented May 5, 2018

I still need to look at the code here, but I don't think we need separate code paths for NaN vs no-NaNs. For example, suppose we have a DataArray with a coordinate x=[0, 1, 2], then da.interp(x=0.5, method='linear') should be equal to 0.5 * (da.sel(x=0) + da.sel(x=1)), regardless of the values of the DataArray. If either da[0] or da[1] is NaN, then the result would be NaN, too.

@fujiisoup
Copy link
Member Author

@shoyer ,
Yes, linear and nearest can handle NaN intuitively, but other methods need to invert some matrices. For example, cubic spline can not be used with nan-including arrays,

In [1]: x = np.arange(4)
   ...: y = np.array([0, 1, np.nan, 3])
   ...: 

In [2]: # 'linear' works as expected
   ...: interp1d(x, y, kind='linear')([0.5, 1.5])
   ...: 
Out[2]: array([0.5, nan])

In [3]: # 'cubic' returns an all nan array
   ...: interp1d(x, y, kind='cubic')([0.5, 1.5])
   ...: 
Out[3]: array([nan, nan])

@fmaussion
Copy link
Member

linear and nearest can handle NaN intuitively, but other methods need to invert some matrices

Why don't leave the NaN handling part to the actual function doing the interpolation? RegularGridInterpolator handles NaNs pretty much the same way @shoyer describes it.

@fujiisoup
Copy link
Member Author

RegularGridInterpolator handles NaNs pretty much the same way @shoyer describes it.

I did not certainly realize how RegularGridInterpolator works, but it seems to work with nan including array too as you pointed out.
But higher order interpolations (RegularGridInterpolator is linear), such as RectBivariateSpline, would not work (return an all nan array) when just 1 nan is included in the target array.
It is the same with higher order interpolations in scipy.interpolate.interp1d such as cubic.

Why don't leave the NaN handling part to the actual function doing the interpolation?

It is the original behavior of scipy.interpolate functions (except for linear or nearest methods), which do some matrix inversions in its algorithm.

I think the addition of the linear interpolation is still nice, but it would be nicer if we could also implement nan-skipping interpolations, which will enable us to use higher order interpolation with nan-including arrays.

@fmaussion
Copy link
Member

I think the addition of the linear interpolation is still nice, but it would be nicer if we could also implement nan-skipping interpolations, which will enable us to use higher order interpolation with nan-including arrays.

Do you mean you want to use unstructured data interpolation routines? (https://docs.scipy.org/doc/scipy/reference/interpolate.html#multivariate-interpolation)

@fujiisoup
Copy link
Member Author

Yes, I am thinking this for multidimensional interpolation of nan-including arrays.
(For 1-dimensional interpolation, we can interpolate them row by row)
Although this should be another work (PR), I am wondering what the best API is.

Is skipna=True option handy? If so, which should be the default?

@shoyer
Copy link
Member

shoyer commented May 7, 2018

My inclination would be to limit this method to interpolation on regular grids, i.e., the methods supported by scipy.interpolate.interpn or interp1d.

I think a skipna=True option could be useful, but I don't know how we could do that for regular interpolation with multi-dimensional data. So I think it might make sense to save skipna=True for interpolation on unstructured grids.

if isinstance(obj, dask_array_type):
import dask.array as da

_assert_single_chunk(obj, range(obj.ndim-len(x), obj.ndim))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator


import xarray as xr
from xarray.tests import assert_allclose, assert_equal, requires_scipy
from . import has_dask, has_scipy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 '.has_scipy' imported but unused

@fujiisoup
Copy link
Member Author

fujiisoup commented Jun 4, 2018

Done!

But, I am slightly worrying if silently dropping non-numeric array is surprising.
Do we have other methods that behave similarly?

Or maybe we can raise a warning?

@shoyer
Copy link
Member

shoyer commented Jun 5, 2018

Do we have other methods that behave similarly?

We do the same thing for aggregations like mean(), behavior originally copied from pandas:
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.mean.html

We could add a numeric_only=True argument (like we do for reduction methods) to make this behavior more obvious and allow it to be controlled.

@fujiisoup
Copy link
Member Author

Thanks, @shoyer

We do the same thing for aggregations like mean(), behavior originally copied from pandas:
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.mean.html

Thanks. I have not been surprised when mean drops the object type array. So it would be OK also for interp.

Maybe this PR is ready, then.

BTW, I noticed that our mean() does not accept the numeric_only keyword argument.
Our docs do not say about numeric_only keyword either.
Is it intended?
(This behavior looks already accepted, and I think it is OK if we do not support this.)

@shoyer
Copy link
Member

shoyer commented Jun 6, 2018

Hmm. I guess numeric_only may currently only be an argument for the reduce method. We probably should add it for methods like mean, too.

@@ -1815,24 +1815,29 @@ def reindex(self, indexers=None, method=None, tolerance=None, copy=True,
return self._replace_vars_and_dims(variables, coord_names)

def interp(self, coords=None, method='linear', assume_sorted=False,
kwargs={}, **coords_kwargs):
keep_attrs=False, kwargs={}, **coords_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination would be to have keep_attrs=True by default, and maybe not even bother to include the argument at all. This would be consistent with reindex/sel/isel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then, I would remove this argument.

@shoyer
Copy link
Member

shoyer commented Jun 6, 2018

Do you want to add interp_like here or should we leave it for a later PR?

@fujiisoup
Copy link
Member Author

interp_like sounds a good idea. I like it in another PR as this is already huge.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last thing to do here is remove keep_attrs?

if len(var.chunks[axis]) > 1 or var.chunks[axis][0] < var.shape[axis]:
raise NotImplementedError(
'Chunking along the dimension to be interpolated '
'({}) is not supported.'.format(axis))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not supported" -> "not yet supported"

scipy.interpolate.interp1d
"""
if not x:
return var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency on views/copies, it's better to return var.copy() here

kwargs.update(method=method)
interp_class = interpolate.interpn
else:
raise ValueError('%s is not a valid interpolator' % method)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add: "for interpolating over multiple dimensions"

@shoyer shoyer mentioned this pull request Jun 7, 2018
4 tasks
@fujiisoup
Copy link
Member Author

Thanks, @shoyer.

I think the last thing to do here is remove keep_attrs?

Done!
(I thought I reverted this change already...)

@fujiisoup fujiisoup mentioned this pull request Jun 7, 2018
@shoyer
Copy link
Member

shoyer commented Jun 7, 2018

OK, let's merge this when CI passes.

@fujiisoup fujiisoup merged commit e397299 into pydata:master Jun 8, 2018
@fujiisoup
Copy link
Member Author

Merged!

Thanks, @shoyer, @fmaussion, @jhamman, for the detailed review.

@rabernat
Copy link
Contributor

rabernat commented Jun 8, 2018

Congratulations on this great new feature!

@yohai
Copy link
Contributor

yohai commented Jun 11, 2018

@fujiisoup Great feature! thanks.
There's a typo ('rondomloy') in the documentation: https://github.com/pydata/xarray/blame/master/doc/interpolation.rst#L192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New feature: interp1d
9 participants