-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supports ingesting arbitrary MPAS output files (in general, input info from namelist and streams files) #20
Comments
@pwolfram: this definitely needs to be prioritized, and made more general. For example, right now, our scripts will fail on ACME alpha8 output because we have changed the filename_templates in our default streams there, and the python scripts are not taking this change into account. Can we discuss what would be the best way to solve this? At what level do you check for namelists and streams: in the run_analysis.py? |
@milenaveneziani and @pwolfram, this is another thing that I could help with once I'm back in 2 weeks. I'll check with you then to see where things stand. |
@milenaveneziani, as far as I can tell, the main problems and potential solutions are:
I'm a little unclear about what does / doesn't need changed or validated for mpas_xarray as well as what we need to do for regional computations. Additional details would be really helpful. Once we have a full list of the problems I think we should be able to design some simple and clean solutions to the problems via a module that contains code that should streamline this process, to probably be put in I think we should be able to assume that namelist / streams are of proper format. The key things is their location and this is going to be one of the first things that is potentially checked, probably right about here after we check to see if the model directory exists: https://github.com/MPAS-Dev/MPAS-Analysis/blob/master/run_analysis.py#L21. If the namelist and streams files don't exist I'm thinking we issue a warning and just go with some set of defaults, recognizing the analysis may fail because additional information is needed but not provided (streams/namelist). |
@pwolfram, @xylar: here are the specific things I can think of right now:
|
Of the list above, we need to prioritize 1-3, because at the moment the scripts will break if running on ACME alpha8 output, for example (because we introduced different namelist/stream names defaults in alpha8). |
@milenaveneziani and @xylar, #26 and #27 are partial parts of the code we need to handle this issue, specifically 1 and 2. I think variable names is much trickier and am envisioning that we use an approach that is something like the following: We store a list of different possible variable names for each given quantity we plot. All variables with the mappings could be stored as a dictionary in an easy to edit file. Essentially, there would be a key word in the plotting script, e.g.,
with
noting that |
@pwolfram wrote:
Let me make sure I understand the purpose here. Are we wanting to support backward compatibility with various ACME versions where variables have been renamed? If so, I think @pwolfram's dictionary idea might work okay. It seems to me that there is always a unique, correct variable name for each ACME version corresponding to a given variable name in the current ACME version. So if we have a way of knowing the ACME version, we should be able to map a current ACME variable name uniquely to its equivalent in previous ACME versions. This would be somewhat better than just having a list of possible variable names and picking whichever one you found first, since there is always the chance that the new name of one variable happens to be the old name of another. Mapping variable names between ACME versions is going to be a bit of a pain. How important is this capability? How often are we going to want to use the latest MPAS-Analysis on old versions of ACME? I just want to make sure the effort is in proportion to the need. |
My understanding is 'yes' to this query, would you agree @milenaveneziani? I'm also guessing that once we settle out of alpha / beta versions the need to support multiple versions of ACME will not be as great for the long term. I think it is ok to "break" as ACME is incremented from major version numbers, e.g., 1.0 to 2.0 etc. However, I'm not sure if everyone agrees with this assessment and it would be good to get feedback from @jonbob who has much better experience than I on how this works in both theory and practice. |
@xylar, @pwolfram: yes, we do want some level of backward compatibility, especially now that we are still a moving target. Right now for example, I can already see 3 namelist/stream names versions: pre-alpha8, alpha8, and alpha9. We need to be able to support all these versions for now. |
@pwolfram and @milenaveneziani, now that #27 has been merged, what are the next steps for addressing this issue? Can we create a checklist? I'd be happy to do some of the work if we can break this into smaller tasks that are easier to wrap my head around. |
@xylar, I've started a checklist at the bottom of the description for the issue above. @milenaveneziani, can you please review these and edit them if necessary? |
@xylar and @milenaveneziani, it is actually cleaner to have this in a separate checklist: #32 |
@milenaveneziani and @pwolfram, this issue is one I'm willing to work on as I have time in the next week or two. The logistics of mapping variable names, while tedious, seems relatively straightforward to me. What isn't so clear is how to identify which versions of each component are being used with a given analysis and, equally important, what the possible options are. For this reason, I'm coming around to @pwolfram's idea of just have a list of possible names for each variable and try each (presumably in arbitrary order or perhaps in the order given in the config file). This approach would work as long as there aren't conflicts where (as I already mentioned) the new name of one variable is the old name of another. Perhaps this situation is too unlikely to be worth considering. This approach has the clear advantage of being easy to extend to new versions without much extra thought. Another approach would be to have a section in the config file for each version of MPAS-O (and one for each version of MPAS-SI) that we want to support with some kind of an identifying name (perhaps something like My tendency is to first try implementing the first ("quick-and-dirty") approach that doesn't care about versions and to move to the more complex mapping that knows about version numbers if/when that seems necessary. I'll try to write this so we can switch to the second approach in the future if needed without needing to make modifications to the analysis scripts themselves. Let me know if that all sounds okay to you two. |
Thanks @xylar.
I frankly do not believe that we will change variable names that easily in the future, especially for the core MPAS variables, so I do not think we should think of supporting very complicated scenarios. Not sure if this is the best approach, but basically this is the extent of the problem right now. |
@milenaveneziani, that's very helpful, especially the bit about I don't know that I want to support the "head" and "tail" concept you suggested. If I make a complete list of vrariables that would fit this combination, could you let me know which actually occur in one version or another? I suspect not all 6 combinations actually occur.
|
Also, I didn't realize the problem as it currently exists is limited to just these 2 names. I'll try not to make this more complicated than it needs to be. |
Ah, I see. Things are a bit more complicated than what I wrote above. There are various specific variables that go between the prefix and suffix you listed above. So we would either need to use the prefix and suffix approach you suggested or do something more tedious like the following:
While this approach seems tedious, I think it will be more robust in the long run. Not all changes in variable names will necessarily be restricted to a prefix and suffix. @milenaveneziani, for one of the variables I listed above, could you fill in all the possible variable names we want to support (for now)? Then, I should be able to make sure the config file has the right list for all variables based on the example you give me. |
About heads and tails: All the variables we use in our script are of the kind: |
@milenaveneziani, alternatively, could you point me to run output for each version from alpha6 to beta1 (or whatever the current name is) so I have example data for each version we want to be supporting? |
We crossed messages.. for MPAS-SI: And apologies: this statement was not true: > 1) MPAS-SI never used the '_1' |
I don't think @akturner will get messages related to this repo because he's not a participant. |
@xylar @milenaveneziani: Hi, I think I still get them if I'm explicitly named. The latest version on cice/develop uses the latest ocean/develop version of the time series stats so will have the exact same naming convention as the ocean. This current beta I think has the original sea ice version of time series stats since the upgrade didnt work in Jon's testing so we deleted the commit that brought it in. |
Yes, so, for example, 'timeSeriesStatsMonthly_avg_iceAreaCell_1' is used up to beta0, and 'timeMonthly_avg_iceAreaCell' will be used in the future. Thanks Adrian. |
Yes, up to and including beta0 is my understanding. |
@xylar, my vote is to keep this as simple as possible. As far as I know, we don't have a robust way to handle specific version numbers. If we did, we could just keep track of specific lists of variables corresponding to each version number. In absence of that I think the cleanest approach is to have an ordered list of variables corresponding to each quantity as its name has evolved from version to version. However, this is obviously fragile because all the variable names for each specific version must be known and it is possible a mistake could be made in the ordering, specifically accidental overlooking of a rename. The big picture problem is that we will have to update specific names here in the analysis as the version numbers change and if we forget to do that an error will arise. I think we can hedge our bets a little by having small test files corresponding to each version so that the list can be tested on essentially a minimalist, real output file. |
If, instead of providing the full list of variables, we only list the affix and suffix parts (maybe one list for MPAS-O and one for MPAS-SI), couldn't we have a double loop that checks whether |
@milenaveneziani, I agree we must check to see if the variable name exists. Is there such a clean delineation for the variable name in terms of pre/suf-fixes? In general this may not necessarily be true and I'm think it is just clearer and simpler to keep track of the whole name in essentially a header file but this is obviously only a minor point. However, if we do have pre/suff-fix then this removes a lot of ambiguity but at the cost of potentially requiring an exception. |
I agree with @pwolfram (this is the point I was making earlier) that we don't want to restrict ourselves to name changes that are prefixes or suffixes. Doing so would make the code simpler but at a potentially high cost. As long as we have config files with appropriate lists of potential names, the only tedious step is creating and maintaining these lists as variable names change. If mistakes are made, the result will be that no matching variable is found, in which case I will make sure the code raises an exception. This should make the error easy to diagnose and fix. |
I think this should just be a section in the config file, rather than in a separate header file. |
@xylar this will make it easier for users to modify if necessary but won't require updates to be submitted as PRs for inclusion. Hence, this is potential a double-edge sword. One way around this is to use a separate config file for "advanced" use that could easily be made into a PR if necessary but not overburden an average user with the option of trying to figure out which name should be used. |
@pwolfram, there is a default config file in the repo. It is not the expectation that the average user would edit parts of that config file they do not understand or need to know about, but they do need to include all the required fields. The new section with a list of valid variable names would be no different. However, I get your point. If a user wanted to do analysis on a new ACME or MPAS-O/MPAS-SI version, their old config file might not work correctly because the list of variable names they have will be incomplete. Nevertheless, the right way of handling this issue does seem to be with a config file. Could we add one to the repo for this specific purpose and ingest it automatically in addition to whichever config file is supplied on the command line? We had a requirement that all configuration would be done via a single file, but this seems like a case that lies outside that requirement. |
@xylar, I would think we could have a special one for this use case. The requirement of a single file was made to ensure that the average user would have an easy time using the analysis package. This is completely within that same spirit. In this case, the config file is basically used to replace a python-style header file with hard-coded dictionaries and it really isn't intended to be modified by users. Perhaps we could hide this inside a subdirectory to provide an additional layer of separation because it is essentially syntactic sugar for our use case. This keeps a single config file a the top level for user editing but provides a "hidden" config file as the mechanism for specifying the variable name changes. @milenaveneziani, do you see downside to having a dedicated config file to handle the variable name changes? |
@milenaveneziani, no, we wouldn't need to do that. Instead, it would just be applended to the list of config files within |
@pwolfram: I re-read your last comment more carefully and I think I answered my own questions (no). |
@pwolfram, I'm not familiar with python header files. I am open to that solution if you want to explain a bit more what that might look like. |
I like this idea:
|
Sorry about the confusion here. I used this very loosely, but meant that we would basically have a module that was a list of ordered dictionaries storing the variable definitions that we could essentially import as desired. Conceptual stuff like that would logically be stored in a C or C++ header file, hence why I called it a "header" file. |
@pwolfram and @milenaveneziani, so I have got a solution for this whole issue that I'm working on debugging. But I just rediscovered that config files don't preserve the case of the options. (I knew this already from my POPSICLES work but it somehow never occurred to me that this would be a major problem for storing the dictionaries we need in a config file.) So I'm going with @pwolfram's solution of having a python file that defines the dictionaries. Maybe that's better anyway. At the moment, I'm failing to see the downside, so I'm wondering why I didn't do it like that from the start. |
@xylar, if I recall correctly, the decision to use config files was for general consistency using the general design heuristic of keeping things as consistent and as simple as possible. It is unfortunate, however, that config files don't preserve case. There could be a solution via a subclass of https://docs.python.org/2/library/configparser.html. However, I agree the best solution is probably just to have a python file with the dictionaries to keep this as simple and as clear as possible. It is unfortunate and maybe even a design flaw that config files drop case information... |
@pwolfram, at this point I don't think a version of ConfigParser is the right tool to handle this problem. While I don't love having dictionaries that live outside of any functions or scripts, I don't see any issues with that approach that aren't purely aesthetic. The current approach has the advantage of simplicity over the ConfigParser solution I originally had in mind. It makes more sense for something that we don't expect to be edited by an analysis user. |
I concur @milenaveneziani that we have the functionality to meet the needs to fulfill this issue. There may be some tweaking required but I'm going to close this for now. @xylar, please reopen if you disagree. |
The MPAS analysis framework is expected to support stand-alone MPAS workflows, as well as ACME workflows. As such, the analysis framework should be capable of ingesting MPAS output, regardless of how it was generated.
NOTE: This could imply that an input to the analysis framework should be the namelist and streams file used to generate the output, which can then be parsed to easily understand what output was generated, and where it should exist.
The text was updated successfully, but these errors were encountered: