Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports ingesting arbitrary MPAS output files (in general, input info from namelist and streams files) #20

Closed
pwolfram opened this issue Sep 15, 2016 · 43 comments

Comments

@pwolfram
Copy link
Contributor

pwolfram commented Sep 15, 2016

The MPAS analysis framework is expected to support stand-alone MPAS workflows, as well as ACME workflows. As such, the analysis framework should be capable of ingesting MPAS output, regardless of how it was generated.

NOTE: This could imply that an input to the analysis framework should be the namelist and streams file used to generate the output, which can then be parsed to easily understand what output was generated, and where it should exist.

@milenaveneziani
Copy link
Collaborator

@pwolfram: this definitely needs to be prioritized, and made more general. For example, right now, our scripts will fail on ACME alpha8 output because we have changed the filename_templates in our default streams there, and the python scripts are not taking this change into account.
Other things that will be affected, as we talked about on Wed, are: variable filenames, existence of certain output depending on whether specific analysis member was run, details of mpas_xarray, regional computations (when we introduce them).

Can we discuss what would be the best way to solve this? At what level do you check for namelists and streams: in the run_analysis.py?

@milenaveneziani milenaveneziani changed the title Supports ingesting arbitrary MPAS output files Supports ingesting arbitrary MPAS output files (in general, input info from namelist and streams files) Sep 30, 2016
@xylar
Copy link
Collaborator

xylar commented Sep 30, 2016

@milenaveneziani and @pwolfram, this is another thing that I could help with once I'm back in 2 weeks. I'll check with you then to see where things stand.

@pwolfram
Copy link
Contributor Author

pwolfram commented Oct 1, 2016

@milenaveneziani, as far as I can tell, the main problems and potential solutions are:

Need Possible Solution
1. Automate selection of input files opened and used by the analysis function Obtain path to file from streams / namelist parsing
2. Ensure that a particular variable can be renamed and this is accounted for in the analysis without error Register new variable to analysis-specific keyword with python class / dictionary (shared file across analysis)
3. Check for existence of model output Needs to be tested for cases of 1 and 2 above

I'm a little unclear about what does / doesn't need changed or validated for mpas_xarray as well as what we need to do for regional computations. Additional details would be really helpful.

Once we have a full list of the problems I think we should be able to design some simple and clean solutions to the problems via a module that contains code that should streamline this process, to probably be put in shared/io.

I think we should be able to assume that namelist / streams are of proper format. The key things is their location and this is going to be one of the first things that is potentially checked, probably right about here after we check to see if the model directory exists: https://github.com/MPAS-Dev/MPAS-Analysis/blob/master/run_analysis.py#L21. If the namelist and streams files don't exist I'm thinking we issue a warning and just go with some set of defaults, recognizing the analysis may fail because additional information is needed but not provided (streams/namelist).

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Oct 4, 2016

@pwolfram, @xylar: here are the specific things I can think of right now:

  1. whether a certain AM is run or not (from namelist)
  2. filenames (from streams)
  3. variable names (from streams)
  4. for mpas_xarray: does xtime_start and xtime_end exist? and how are these quantities called exactly? (not sure if this is something we would get from the streams or the registry: tagging @jonwoodring here also). If xtime_start/xtime_end do not exist, use daySinceStartOfSim.
  5. and then there is the regions issue. What I am thinking of at the moment is not the regional stats AM, but rather the regional computations that we would like to do for some existing am's, such as the MOC, the MHT, the layerWeightedAverages and the surfaceAverages. At the moment, we have a placeholder for iregions, but we need to find a way (using the regional mask files and their attributes, I suppose) to identify the regions that are associated with a certain group, for example, and apply the analysis script to those regions only. I am still a bit vague about how this would be implemented in practice, but I think it will be useful to look at a regional mask file and the way that that is used in evolving AM's such as the MOC and the regional MHT.

@milenaveneziani
Copy link
Collaborator

Of the list above, we need to prioritize 1-3, because at the moment the scripts will break if running on ACME alpha8 output, for example (because we introduced different namelist/stream names defaults in alpha8).

@pwolfram
Copy link
Contributor Author

pwolfram commented Oct 6, 2016

@milenaveneziani and @xylar, #26 and #27 are partial parts of the code we need to handle this issue, specifically 1 and 2. I think variable names is much trickier and am envisioning that we use an approach that is something like the following:

We store a list of different possible variable names for each given quantity we plot. All variables with the mappings could be stored as a dictionary in an easy to edit file. Essentially, there would be a key word in the plotting script, e.g., sst that is the key in the dictionary. This key could be mapped to data from a dataset where a list of possible variable names, e.g., ['time_avg_avgValueWithinOceanRegion_avgSurfaceTemperature', ...] is iterated over for the dataset until the correct variable is found. Once this happens, this data is returned. This would be done with a function that could replace code like

SSTregions = ds.time_avg_avgValueWithinOceanRegion_avgSurfaceTemperature

with

SSTregions = get_dataset_data_from_generalized_name(generalvariablenames, 'regionalavgsst', ds)

noting that generalvariablenames['regionavgsst'] = ['time_avg_avgValueWithinOceanRegion_avgSurfaceTemperature', 'avgSurfaceTemperature'] is specified in a config file via a dictionary.

@xylar
Copy link
Collaborator

xylar commented Oct 21, 2016

@pwolfram wrote:

We store a list of different possible variable names for each given quantity we plot.

Let me make sure I understand the purpose here. Are we wanting to support backward compatibility with various ACME versions where variables have been renamed? If so, I think @pwolfram's dictionary idea might work okay. It seems to me that there is always a unique, correct variable name for each ACME version corresponding to a given variable name in the current ACME version. So if we have a way of knowing the ACME version, we should be able to map a current ACME variable name uniquely to its equivalent in previous ACME versions. This would be somewhat better than just having a list of possible variable names and picking whichever one you found first, since there is always the chance that the new name of one variable happens to be the old name of another.

Mapping variable names between ACME versions is going to be a bit of a pain. How important is this capability? How often are we going to want to use the latest MPAS-Analysis on old versions of ACME? I just want to make sure the effort is in proportion to the need.

@pwolfram
Copy link
Contributor Author

@xylar:

Let me make sure I understand the purpose here. Are we wanting to support backward compatibility with various ACME versions where variables have been renamed?

My understanding is 'yes' to this query, would you agree @milenaveneziani?

I'm also guessing that once we settle out of alpha / beta versions the need to support multiple versions of ACME will not be as great for the long term. I think it is ok to "break" as ACME is incremented from major version numbers, e.g., 1.0 to 2.0 etc. However, I'm not sure if everyone agrees with this assessment and it would be good to get feedback from @jonbob who has much better experience than I on how this works in both theory and practice.

@milenaveneziani
Copy link
Collaborator

@xylar, @pwolfram: yes, we do want some level of backward compatibility, especially now that we are still a moving target. Right now for example, I can already see 3 namelist/stream names versions: pre-alpha8, alpha8, and alpha9. We need to be able to support all these versions for now.
I figure when things will be more stable (especially as far as basic MPAS analysis members go), we will be less likely to break, and at that point we can decide whether we can declare the very old versions 'unsupported' and focus on the newer changes.

@xylar
Copy link
Collaborator

xylar commented Oct 25, 2016

@pwolfram and @milenaveneziani, now that #27 has been merged, what are the next steps for addressing this issue? Can we create a checklist? I'd be happy to do some of the work if we can break this into smaller tasks that are easier to wrap my head around.

@pwolfram
Copy link
Contributor Author

@xylar, I've started a checklist at the bottom of the description for the issue above. @milenaveneziani, can you please review these and edit them if necessary?

@pwolfram
Copy link
Contributor Author

@xylar and @milenaveneziani, it is actually cleaner to have this in a separate checklist: #32

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

@milenaveneziani and @pwolfram, this issue is one I'm willing to work on as I have time in the next week or two. The logistics of mapping variable names, while tedious, seems relatively straightforward to me. What isn't so clear is how to identify which versions of each component are being used with a given analysis and, equally important, what the possible options are.

For this reason, I'm coming around to @pwolfram's idea of just have a list of possible names for each variable and try each (presumably in arbitrary order or perhaps in the order given in the config file). This approach would work as long as there aren't conflicts where (as I already mentioned) the new name of one variable is the old name of another. Perhaps this situation is too unlikely to be worth considering. This approach has the clear advantage of being easy to extend to new versions without much extra thought.

Another approach would be to have a section in the config file for each version of MPAS-O (and one for each version of MPAS-SI) that we want to support with some kind of an identifying name (perhaps something like ACME_alpha7 or v5.1). This section would give the variable mapping for that version. We would need to add the MPAS-O and MPAS-SI version names to the PreAndPostProcessingScripts repo somehow, e.g. by having an ACME version name that could be used to select the corresponding MPAS-O and MPAS-SI version.

My tendency is to first try implementing the first ("quick-and-dirty") approach that doesn't care about versions and to move to the more complex mapping that knows about version numbers if/when that seems necessary. I'll try to write this so we can switch to the second approach in the future if needed without needing to make modifications to the analysis scripts themselves.

Let me know if that all sounds okay to you two.

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Nov 29, 2016

Thanks @xylar.
At this point, I am still not sure what the best approach will be. But let me write down exactly where the problem arises going from alpha6 to beta0 and eventually beta, so perhaps we will have a better perspective:

  1. our priority metrics (Tier 1) are based on monthly averaged data (or climatologies computed from monthly averages);
  2. therefore, the main issue right now is to support different versions of the timeSeriesStats AM (for both MPAS-O and MPAS-SI) going from alpha6 to beta. The main changes from alpha6 to beta0 involve MPAS-O only: a) AM name changes and b) variable names change. a) is important to remember for the bit of code that looks into the streams for the output filename.. (different, but related issue);
  3. then, when going from beta0 to beta (and hopefully just v1), the changes in b) (only, I believe) will be found for the MPAS-SI timeSeriesStats AM as well.

I frankly do not believe that we will change variable names that easily in the future, especially for the core MPAS variables, so I do not think we should think of supporting very complicated scenarios.
Seems like a list of options could be sufficient. Something like:
timeSeriesStatsAMNames=['timeSeriesStats','timeSeriesStatsMonthly']
timeSeriesStatsVarNames_head=['time_avg','timeMonthly_avg','timeSeriesStatsMonthly_avg']
timeSeriesStatsVarNames_tail=['','_1']

Not sure if this is the best approach, but basically this is the extent of the problem right now.

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

@milenaveneziani, that's very helpful, especially the bit about timeSeriesStats vs. timeSeriesStatsMonthly.

I don't know that I want to support the "head" and "tail" concept you suggested. If I make a complete list of vrariables that would fit this combination, could you let me know which actually occur in one version or another? I suspect not all 6 combinations actually occur.

[VarNames]
timeSeriesStatsAMNames = [timeSeriesStats, timeSeriesStatsMonthly]
timeSeriesStatsVarNames = [time_avg, time_avg_1, timeMonthly_avg, timeMonthly_avg_1, timeSeriesStatsMonthly_av, timeSeriesStatsMonthly_avg_1]

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

Also, I didn't realize the problem as it currently exists is limited to just these 2 names. I'll try not to make this more complicated than it needs to be.

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

Ah, I see. Things are a bit more complicated than what I wrote above. There are various specific variables that go between the prefix and suffix you listed above. So we would either need to use the prefix and suffix approach you suggested or do something more tedious like the following:

[VarNames]
timeSeriesStatsAMNames = [timeSeriesStats, timeSeriesStatsMonthly]
avgSurfaceTemperature = [time_avg_avgValueWithinOceanRegion_avgSurfaceTemperature, ...]
avgLayerTemperature = [time_avg_avgValueWithinOceanLayerRegion_avgLayerTemperature, ...]
sumLayerMaskValue = [time_avg_avgValueWithinOceanLayerRegion_sumLayerMaskValue, ...]
avgLayerArea = [time_avg_avgValueWithinOceanLayerRegion_avgLayerArea, ...]
...

While this approach seems tedious, I think it will be more robust in the long run. Not all changes in variable names will necessarily be restricted to a prefix and suffix.

@milenaveneziani, for one of the variables I listed above, could you fill in all the possible variable names we want to support (for now)? Then, I should be able to make sure the config file has the right list for all variables based on the example you give me.

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Nov 29, 2016

About heads and tails: 1) MPAS-SI never used the '_1'; 2) I am not 100% sure, but I think that MPAS-O still uses it in beta, but only if you specified more than 1 kind of monthly (or annual) average (this option makes more sense for the CUSTOM instance in my mind, so that's why I am not sure whether it is still valid for the current timeSeriesStats AM). So mostly, in beta, we will have ' ' as tail.

All the variables we use in our script are of the kind:
'timeSeriesStatsVarName_head' + 'mainMPASVarName' + 'timeSeriesStatsVarName_tail'
where mainMPASVarName is either the AM varname or the main MPAS varname (in most cases the former).

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

@milenaveneziani, for one of the variables I listed above, could you fill in all the possible variable names we want to support (for now)? Then, I should be able to make sure the config file has the right list for all variables based on the example you give me.

@milenaveneziani, alternatively, could you point me to run output for each version from alpha6 to beta1 (or whatever the current name is) so I have example data for each version we want to be supporting?

@milenaveneziani
Copy link
Collaborator

We crossed messages..
Here is an example for MPAS-O:
avgSurfaceTemperature = [time_avg_avgValueWithinOceanRegion_avgSurfaceTemperature, time_avg_avgValueWithinOceanRegion_avgSurfaceTemperature_1, timeMonthly_avg_avgValueWithinOceanRegion_avgSurfaceTemperature]

for MPAS-SI:
['timeSeriesStatsMonthly_avg_iceAreaCell_1','timeMonthly_avg_iceAreaCell']
(@akturner: could you please confirm that this will be the case in MPAS-SI beta?)

And apologies: this statement was not true: > 1) MPAS-SI never used the '_1'

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

I don't think @akturner will get messages related to this repo because he's not a participant.

@akturner
Copy link
Contributor

@xylar @milenaveneziani: Hi, I think I still get them if I'm explicitly named. The latest version on cice/develop uses the latest ocean/develop version of the time series stats so will have the exact same naming convention as the ocean. This current beta I think has the original sea ice version of time series stats since the upgrade didnt work in Jon's testing so we deleted the commit that brought it in.

@milenaveneziani
Copy link
Collaborator

Yes, so, for example, 'timeSeriesStatsMonthly_avg_iceAreaCell_1' is used up to beta0, and 'timeMonthly_avg_iceAreaCell' will be used in the future. Thanks Adrian.

@akturner
Copy link
Contributor

Yes, up to and including beta0 is my understanding.

@pwolfram
Copy link
Contributor Author

@xylar, my vote is to keep this as simple as possible. As far as I know, we don't have a robust way to handle specific version numbers. If we did, we could just keep track of specific lists of variables corresponding to each version number. In absence of that I think the cleanest approach is to have an ordered list of variables corresponding to each quantity as its name has evolved from version to version. However, this is obviously fragile because all the variable names for each specific version must be known and it is possible a mistake could be made in the ordering, specifically accidental overlooking of a rename. The big picture problem is that we will have to update specific names here in the analysis as the version numbers change and if we forget to do that an error will arise.

I think we can hedge our bets a little by having small test files corresponding to each version so that the list can be tested on essentially a minimalist, real output file.

@milenaveneziani
Copy link
Collaborator

If, instead of providing the full list of variables, we only list the affix and suffix parts (maybe one list for MPAS-O and one for MPAS-SI), couldn't we have a double loop that checks whether
affix+mainVarname+suffix exists?
I admit this is not too pretty, but perhaps more robust against the likelihood of missing some variables?
just my naive perspective here :)

@pwolfram
Copy link
Contributor Author

@milenaveneziani, I agree we must check to see if the variable name exists. Is there such a clean delineation for the variable name in terms of pre/suf-fixes? In general this may not necessarily be true and I'm think it is just clearer and simpler to keep track of the whole name in essentially a header file but this is obviously only a minor point. However, if we do have pre/suff-fix then this removes a lot of ambiguity but at the cost of potentially requiring an exception.

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

I agree with @pwolfram (this is the point I was making earlier) that we don't want to restrict ourselves to name changes that are prefixes or suffixes. Doing so would make the code simpler but at a potentially high cost. As long as we have config files with appropriate lists of potential names, the only tedious step is creating and maintaining these lists as variable names change. If mistakes are made, the result will be that no matching variable is found, in which case I will make sure the code raises an exception. This should make the error easy to diagnose and fix.

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

I'm think it is just clearer and simpler to keep track of the whole name in essentially a header file

I think this should just be a section in the config file, rather than in a separate header file.

@pwolfram
Copy link
Contributor Author

I think this should just be a section in the config file, rather than in a separate header file.

@xylar this will make it easier for users to modify if necessary but won't require updates to be submitted as PRs for inclusion. Hence, this is potential a double-edge sword. One way around this is to use a separate config file for "advanced" use that could easily be made into a PR if necessary but not overburden an average user with the option of trying to figure out which name should be used.

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

@pwolfram, there is a default config file in the repo. It is not the expectation that the average user would edit parts of that config file they do not understand or need to know about, but they do need to include all the required fields. The new section with a list of valid variable names would be no different.

However, I get your point. If a user wanted to do analysis on a new ACME or MPAS-O/MPAS-SI version, their old config file might not work correctly because the list of variable names they have will be incomplete.

Nevertheless, the right way of handling this issue does seem to be with a config file. Could we add one to the repo for this specific purpose and ingest it automatically in addition to whichever config file is supplied on the command line? We had a requirement that all configuration would be done via a single file, but this seems like a case that lies outside that requirement.

@pwolfram
Copy link
Contributor Author

@xylar, I would think we could have a special one for this use case. The requirement of a single file was made to ensure that the average user would have an easy time using the analysis package. This is completely within that same spirit. In this case, the config file is basically used to replace a python-style header file with hard-coded dictionaries and it really isn't intended to be modified by users. Perhaps we could hide this inside a subdirectory to provide an additional layer of separation because it is essentially syntactic sugar for our use case. This keeps a single config file a the top level for user editing but provides a "hidden" config file as the mechanism for specifying the variable name changes.

@milenaveneziani, do you see downside to having a dedicated config file to handle the variable name changes?

@milenaveneziani
Copy link
Collaborator

@pwolfram, @xylar: would we have to add it to the command line to make things work?
In other words, would we run 'pyton run_analysis.py configfile1 configfile2'?

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

@milenaveneziani, no, we wouldn't need to do that. Instead, it would just be applended to the list of config files within run_analysis.py

@milenaveneziani
Copy link
Collaborator

@pwolfram: I re-read your last comment more carefully and I think I answered my own questions (no).
sounds good to me then.

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

@pwolfram, I'm not familiar with python header files. I am open to that solution if you want to explain a bit more what that might look like.

@milenaveneziani
Copy link
Collaborator

I like this idea:

Perhaps we could hide this inside a subdirectory to provide an additional layer of separation because it is essentially syntactic sugar for our use case. This keeps a single config file a the top level for user editing but provides a "hidden" config file as the mechanism for specifying the variable name changes.

@pwolfram
Copy link
Contributor Author

@xylar,

@pwolfram, I'm not familiar with python header files. I am open to that solution if you want to explain a bit more what that might look like.

Sorry about the confusion here. I used this very loosely, but meant that we would basically have a module that was a list of ordered dictionaries storing the variable definitions that we could essentially import as desired. Conceptual stuff like that would logically be stored in a C or C++ header file, hence why I called it a "header" file.

@xylar
Copy link
Collaborator

xylar commented Dec 4, 2016

@pwolfram and @milenaveneziani, so I have got a solution for this whole issue that I'm working on debugging. But I just rediscovered that config files don't preserve the case of the options. (I knew this already from my POPSICLES work but it somehow never occurred to me that this would be a major problem for storing the dictionaries we need in a config file.) So I'm going with @pwolfram's solution of having a python file that defines the dictionaries. Maybe that's better anyway. At the moment, I'm failing to see the downside, so I'm wondering why I didn't do it like that from the start.

@pwolfram
Copy link
Contributor Author

pwolfram commented Dec 5, 2016

@xylar, if I recall correctly, the decision to use config files was for general consistency using the general design heuristic of keeping things as consistent and as simple as possible. It is unfortunate, however, that config files don't preserve case. There could be a solution via a subclass of https://docs.python.org/2/library/configparser.html. However, I agree the best solution is probably just to have a python file with the dictionaries to keep this as simple and as clear as possible. It is unfortunate and maybe even a design flaw that config files drop case information...

@xylar
Copy link
Collaborator

xylar commented Dec 5, 2016

@pwolfram, at this point I don't think a version of ConfigParser is the right tool to handle this problem. While I don't love having dictionaries that live outside of any functions or scripts, I don't see any issues with that approach that aren't purely aesthetic. The current approach has the advantage of simplicity over the ConfigParser solution I originally had in mind. It makes more sense for something that we don't expect to be edited by an analysis user.

@milenaveneziani
Copy link
Collaborator

My thinking is that this issue could be closed, since #52 is fulfilled. We do have the list of tasks in #32, which include one or two of the features that I mentioned when this issue was created (and which are still to-do items), so we should be covered.

@pwolfram
Copy link
Contributor Author

I concur @milenaveneziani that we have the functionality to meet the needs to fulfill this issue. There may be some tweaking required but I'm going to close this for now. @xylar, please reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants