Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selection of specific input files in xarray #36

Closed
milenaveneziani opened this issue Oct 25, 2016 · 11 comments
Closed

Selection of specific input files in xarray #36

milenaveneziani opened this issue Oct 25, 2016 · 11 comments

Comments

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Oct 25, 2016

At the moment, when loading multi-files with xarray, the input string cannot be a list or include wild-card characters of the kind, say [0-3].
This is not useful when wanting to load a reduced dataset when the goal is to compute climatologies over a certain number of years.

Say for example that we want to compute the following seasonal averages: DJF, MAM, JJA, SON, over year 31-40. We would want to load the following input files:
year3[0-9], year40
but instead xarray only allows something like year3?

@xylar
Copy link
Collaborator

xylar commented Oct 25, 2016

@milenaveneziani, good call. xarray also allows you to supply a simple list of files, so we could probably use glob.glob, which supports regular expressions of the type you want, to get a list of file paths to pass on to xarray. @pwolfram, what do you think? Could this be added to mpas_xarray so we don't have to alter every call to xarray.open_mfdataset?

@pwolfram
Copy link
Contributor

@xylar, yes this can be solved but we would have to use glob outside mpas_xarray / xarray because xarray expects a string on input, e.g., from their code:

if isinstance(paths, basestring):
    paths = sorted(glob(paths))
if not paths:
    raise IOError('no files to open')

Therefore, we could make a function like

def paths(*args):
    """ 
    Returns glob'd paths in list for arbitrary number of function arguments.
    Note, each expanded set of paths is sorted. 

    Phillip J. Wolfram
    10/25/2016
    """
    paths = []
    for aargs in args:
        paths += sorted(glob.glob(aargs))
    return paths

and call xarray via

ds = xr.open_mfdataset(paths('*year3[0-9]*nc', '*year40*nc'), ...)

I would say the syntax here is simple and fairly clean and will give you a list of paths that xarray should open.

I've submitted a PR: #37

@xylar
Copy link
Collaborator

xylar commented Oct 27, 2016

cc @pwolfram, @milenaveneziani, @vanroekel (I think this comment got missed because I didn't explicitly put any names on it).

While #37 and #38 partly address this issue, there is not currently a way to specify (e.g. in the config file) the start and end times of the analysis. How would we want to specify start and end times? Presumably with date strings?

A related question is how we want to apply that range. It would seem most useful to do this via StreamsFile.readpath, since one still knows a reasonable amount about the date formatting at that point. But there are complications here, e.g. that at this stage we don't know what files are actually available.

An option would be for StreamsFile.readpath to become more sophisticated, taking optional start and end dates, (and possibly even a stride, though that gets complicated indeed). StreamsFile.readpath could use shared.io.utility.paths to get a list of all possible files but then it could remove files from the list that were not in range (meaning it would parse out the date again from each file name, and then perform date comparison -- again a bit tricky).

thoughts?

@pwolfram
Copy link
Contributor

@xylar, if we have the general wild-carded paths we can use xarray to do the time slicing. I think this is the simplest approach because xarray does lazy evaluation and doesn't have to load all the data. However, we should keep this in mind in case we find that opening a ton of files creates either a crash or performance issues (which it can at certain scales, e.g., 1000+ files). If so, this is probably a brand new issue / discussion on how to deal with that problem.

@pwolfram
Copy link
Contributor

I think we have addressed this issue with #37 and should close this issue unless there is something else we are missing-- would you agree @xylar and @milenaveneziani? If so, please feel free to close this issue.

@xylar
Copy link
Collaborator

xylar commented Oct 28, 2016

@pwolfram, no, I don't think this has been addressed. See my comment above: #36 (comment)

@xylar
Copy link
Collaborator

xylar commented Oct 28, 2016

@pwolfram, I now think this issue will be fully addressed by #37. Even so, I don't think the issue should be closed until #37 gets merged, since the issue would remain unaddressed if we decided to close #37 for some reason. Does that make sense to you? Or is it standard GitHub practice to close issues as soon as there is a PR that addresses them?

@pwolfram
Copy link
Contributor

pwolfram commented Nov 1, 2016

@xylar, my understanding is that the practice is to close the issue once the PR is merged. However, it looks like #38 is derivative from this issue so it shouldn't be closed just yet.

@milenaveneziani
Copy link
Collaborator Author

not closing this until we sort out the start/end time issue encountered in #38.

@milenaveneziani
Copy link
Collaborator Author

Since we now have #45, I think we can perhaps close this one?

@xylar
Copy link
Collaborator

xylar commented Nov 4, 2016

Agreed.

@xylar xylar closed this as completed Nov 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants