Selection of specific input files in xarray #36

milenaveneziani · 2016-10-25T16:14:51Z

At the moment, when loading multi-files with xarray, the input string cannot be a list or include wild-card characters of the kind, say [0-3].
This is not useful when wanting to load a reduced dataset when the goal is to compute climatologies over a certain number of years.

Say for example that we want to compute the following seasonal averages: DJF, MAM, JJA, SON, over year 31-40. We would want to load the following input files:
year3[0-9], year40
but instead xarray only allows something like year3?

xylar · 2016-10-25T16:41:41Z

@milenaveneziani, good call. xarray also allows you to supply a simple list of files, so we could probably use glob.glob, which supports regular expressions of the type you want, to get a list of file paths to pass on to xarray. @pwolfram, what do you think? Could this be added to mpas_xarray so we don't have to alter every call to xarray.open_mfdataset?

pwolfram · 2016-10-26T02:58:35Z

@xylar, yes this can be solved but we would have to use glob outside mpas_xarray / xarray because xarray expects a string on input, e.g., from their code:

if isinstance(paths, basestring):
    paths = sorted(glob(paths))
if not paths:
    raise IOError('no files to open')

Therefore, we could make a function like

def paths(*args):
    """ 
    Returns glob'd paths in list for arbitrary number of function arguments.
    Note, each expanded set of paths is sorted. 

    Phillip J. Wolfram
    10/25/2016
    """
    paths = []
    for aargs in args:
        paths += sorted(glob.glob(aargs))
    return paths

and call xarray via

ds = xr.open_mfdataset(paths('*year3[0-9]*nc', '*year40*nc'), ...)

I would say the syntax here is simple and fairly clean and will give you a list of paths that xarray should open.

I've submitted a PR: #37

xylar · 2016-10-27T12:14:08Z

cc @pwolfram, @milenaveneziani, @vanroekel (I think this comment got missed because I didn't explicitly put any names on it).

While #37 and #38 partly address this issue, there is not currently a way to specify (e.g. in the config file) the start and end times of the analysis. How would we want to specify start and end times? Presumably with date strings?

A related question is how we want to apply that range. It would seem most useful to do this via StreamsFile.readpath, since one still knows a reasonable amount about the date formatting at that point. But there are complications here, e.g. that at this stage we don't know what files are actually available.

An option would be for StreamsFile.readpath to become more sophisticated, taking optional start and end dates, (and possibly even a stride, though that gets complicated indeed). StreamsFile.readpath could use shared.io.utility.paths to get a list of all possible files but then it could remove files from the list that were not in range (meaning it would parse out the date again from each file name, and then perform date comparison -- again a bit tricky).

thoughts?

pwolfram · 2016-10-28T03:17:23Z

@xylar, if we have the general wild-carded paths we can use xarray to do the time slicing. I think this is the simplest approach because xarray does lazy evaluation and doesn't have to load all the data. However, we should keep this in mind in case we find that opening a ton of files creates either a crash or performance issues (which it can at certain scales, e.g., 1000+ files). If so, this is probably a brand new issue / discussion on how to deal with that problem.

pwolfram · 2016-10-28T03:18:36Z

I think we have addressed this issue with #37 and should close this issue unless there is something else we are missing-- would you agree @xylar and @milenaveneziani? If so, please feel free to close this issue.

xylar · 2016-10-28T09:57:16Z

@pwolfram, no, I don't think this has been addressed. See my comment above: #36 (comment)

xylar · 2016-10-28T20:39:00Z

@pwolfram, I now think this issue will be fully addressed by #37. Even so, I don't think the issue should be closed until #37 gets merged, since the issue would remain unaddressed if we decided to close #37 for some reason. Does that make sense to you? Or is it standard GitHub practice to close issues as soon as there is a PR that addresses them?

pwolfram · 2016-11-01T17:39:29Z

@xylar, my understanding is that the practice is to close the issue once the PR is merged. However, it looks like #38 is derivative from this issue so it shouldn't be closed just yet.

milenaveneziani · 2016-11-04T04:46:46Z

not closing this until we sort out the start/end time issue encountered in #38.

milenaveneziani · 2016-11-04T15:41:08Z

Since we now have #45, I think we can perhaps close this one?

xylar · 2016-11-04T15:43:08Z

Agreed.

pwolfram mentioned this issue Oct 26, 2016

Adds path function to simplify xarray open_mfdataset calls #37

Merged

xylar mentioned this issue Oct 28, 2016

Read paths from streams and config from namelist #38

Merged

xylar closed this as completed Nov 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selection of specific input files in xarray #36

Selection of specific input files in xarray #36

milenaveneziani commented Oct 25, 2016 •

edited

Loading

xylar commented Oct 25, 2016

pwolfram commented Oct 26, 2016

xylar commented Oct 27, 2016 •

edited

Loading

pwolfram commented Oct 28, 2016

pwolfram commented Oct 28, 2016

xylar commented Oct 28, 2016

xylar commented Oct 28, 2016

pwolfram commented Nov 1, 2016 •

edited

Loading

milenaveneziani commented Nov 4, 2016

milenaveneziani commented Nov 4, 2016

xylar commented Nov 4, 2016

Selection of specific input files in xarray #36

Selection of specific input files in xarray #36

Comments

milenaveneziani commented Oct 25, 2016 • edited Loading

xylar commented Oct 25, 2016

pwolfram commented Oct 26, 2016

xylar commented Oct 27, 2016 • edited Loading

pwolfram commented Oct 28, 2016

pwolfram commented Oct 28, 2016

xylar commented Oct 28, 2016

xylar commented Oct 28, 2016

pwolfram commented Nov 1, 2016 • edited Loading

milenaveneziani commented Nov 4, 2016

milenaveneziani commented Nov 4, 2016

xylar commented Nov 4, 2016

milenaveneziani commented Oct 25, 2016 •

edited

Loading

xylar commented Oct 27, 2016 •

edited

Loading

pwolfram commented Nov 1, 2016 •

edited

Loading