Replies: 3 comments
-
From: "Taylor, Karl E." Hi Martin, Paul, and Matt (and relying again on Matt to deliver the mail to Martin for me) I’ve responded to Martin’s comments below. I don’t think we’re too far apart in our thinking and look forward to any further comments/suggestions you have. Best regards, (A) MJ: First on "realms": we dropped this from the initial call for data request submissions in CMIP6 and got very strong feedback that the community needed it, so I doubt they will be happy with losing it now. I expect we will have to define the realm attribute, whether or not it is in the branded name. Variable names are not objective, so I don't see that the absence of a fully objective method of assigning realm is a decisive argument here. As you note later, variables will have realms assigned. The question here is whether we use it or not. KET responds: Regarding “the community needed it”, do you recall for what purpose? Does “realm” need to be in the branded name to fulfill the community’s needs? KET responds: Regarding “doubt they will be happy with losing it”, I don’t propose doing away with the “realm” (or “modeling_realm”) attribute. I suggest removing it from the variable label. For many variables, I think “realm” should be assigned multiple values (as in CMIP6). The “realm” attribute” could be useful for multiple purposes including: 1) identifying variables that are produced by a particular model component (atmosphere, land, sea ice, etc.), and 2) identifying variables that might be of special interest to a particular “community” (and used in filtering searches). MJ: It is clear that the realm is often superfluous for identification. Tables such as Omon and CFmon combine variables from multiple realms which are uniquely identified by the name and table. The same is true of the abbreviated area types which you are proposing. Often superfluous, but one or the other of these categories is needed to have a robust system which allows flexibility. It would be helpful, I think, to have a clear statement of what the branding is aiming to achieve. Such as:
KET responds: Regarding “area types have clear links to realms”, it’s true that you could assign each area type to one of a few realms. For example, both of the area types “sea”and “ice_free_sea” could be assigned to a single realm=”sea”, but then you would need two different variable roots to distinguish surface_temperature “where sea” from surface_temperature “where ice_free_sea”. I think it is cleaner to have a single variable root (“ts”) identify surface_temperature no matter where, and then use area type branding to distinguish between “sea” and ice_free_sea”. KET responds: Regarding “it [the ‘where’ element] also gives a systematic approach to identify a realm”, I originally thought along those lines, but when I tried to implement such an approach I failed. I think the “realm”, as we’ve defined it in the past has been assigned considering two primary aspects: what model component was responsible for calculating the variable, and what users might associate with the variable (in filtering searches). The first decision to make is how many realms should be defined? For CMIP5 and CMIP6 we had eight: atmos, ocean, land, landIce, seaIce, aerosol, atmosChem, and ocnBgchem. One could argue for a finer granularity, for example, “land” split between “soil” and “vegetation” or simply detaching “landBgchem” from “land”. However these categories are defined, determination of the realm(s) associated with any variable can be difficult. Any variable defining a “flux” at an interface between two reservoirs (e.g., between atmosphere and ocean) might be assigned with equal justification to either of the two. We generally assigned these variables to both categories, but our determination of which category should be primary was somewhat arbitrary. Similarly, certain variables are clearly of interest to multiple realms. Methane concentration in the atmosphere could be included in both atmosChem (because it impacts chemical reactions) but also “atmos” (because it impacts radiative transfer). Which should be the primary realm for methane concentration? It was questions like these that led me to the much simpler and objective approach of defining the areaLabelDD for the purposes of uniquely identifying variables. We can define the realm for other purposes. (B) MJ: I was suggesting "s", rather than "zs", but I think your comments apply equally. I wouldn't want to use either "z0" or "s" for the base of the thermocline or tropopause. Both of these are surfaces defined by a transition in stratification. We could either look for a letter or letters to stand for a stratification vertical coordinate, or use specific identifies such as "tp" and "tc". KET responds: Regarding the above, I was attempting to do the simplest thing here, so that the verticalLabelDD could be automatically generated by just determining if a vertical coordinate was or was not defined for a variable. If no vertical coordinate were defined, then “z0” would be assigned (but I’m open to suggestion for a different text string). [The “z0” was supposed to be analgous to z1, z2, z3, etc. which indicate reporting at 1, 2, and 3 depths in the vertical, respectively; z0 would indicate that there was no explicitly defined depth (i.e., no vertical coordinate).] Distinguishing among all the individual types of variables that don’t have vertical coordinates leads to an algorithm that is very (hopelessly?) complicated. I note among the various types of variables we need to account for are: fluxes at the top of the atmosphere, quantities at the interface between the atmosphere and the medium below (i.e., the ”surface”), quantities at the interface between the ocean surface and sea ice, quantities at the interface between the land ice and the medium below it (land or sea), quantities at the interface between snow and the medium below, the “tropopause”, the “base of the thermocline”, at cloud top, at cloud base, at mean sea level, integrals through a column of the atmosphere, ocean, sea ice, and many, many more. (Note that not all of these are surfaces coincide with “a transition in stratification”.) Each of these would need a different verticalLabelDD, which no one would likely be able to remember. I think I’d like to stick with “z0”, or some other generic label, that simply means “no vertical coordinate”. We might, for example, use “x” to indicate this, rather than “z0”, because “x” might be interpreted and remembered as “vertical coordinate excluded”. MJ: Is "h010" easier to parse than "10m"? I'm not sure. I suppose your proposal can be parsed with a shorter piece of code, but I think the difference is marginal. For humans, however, there is a big difference in legibility. KET responds: Regarding the options for representing a field reported on a single coordinate surface in the vertical, I was concerned that an ocean variable requested on a particular density surface would need a rather long label: for example, the density surface “1026 kg m-3” could be written “1026kgm-3”, but the hyphen might be problematic and the normal unit for this is "sigma-t” (density in kg m-3 - 1000 kg m-3), which I would have trouble abbreviating. Thus, I was led to standardizing the unit for each type of vertical coordinate, but omitting the unit from the label. Thus, instead of 1026kgm-3, we would have rho26. Perhaps that concern should not dominate our thinking here, though. By including the units, we might avoid having decimal points represented by “p” (under my scheme) and there would be no need for the “prefix” on the verticalLabelDD identifying what type of vertical coordinate we have (e.g, instead of p0500, we would have “500hpa” because “hpa” would imply a pressure coordinate). I note that if we adopt this alternative, “1 meter depth below the surface” and “1 meter height above the surface” would have the same label: “1m”. Could there be some case where this ambiguity would be a problem? If not, I think I now favor your approach here. (C) MJ: OK, I think the 6 classes would be cleaner. (D) MJ: OK .. perhaps we could offer two versions of the area type vocabulary for consultation: one short version and one expanded for legibility. I hope we can also offer an option of using the realms. (E) MJ: OK, that sounds good, From: Martin Juckes - STFC UKRI Hi Karl, (A) First on "realms": we dropped this from the initial call for data request submissions in CMIP6 and got very strong feedback that the community needed it, so I doubt they will be happy with losing it now. I expect we will have to define the realm attribute, whether or not it is in the branded name. Variable names are not objective, so I don't see that the absence of a fully objective method of assigning realm is a decisive argument here. As you note later, variables will have realms assigned. The question here is whether we use it or not. It is clear that the realm is often superfluous for identification. Tables such as Omon and CFmon combine variables from multiple realms which are uniquely identified by the name and table. The same is true of the abbreviated area types which you are proposing. Often superfluous, but one or the other of these categories is needed to have a robust system which allows flexibility. It would be helpful, I think, to have a clear statement of what the branding is aiming to achieve. Such as:
Using the "where" element of the cell methods does give a systematic approach to identifying an area type suffix. However, as area types have clear links to realms, it also gives a systematic approach to identify a realm. I think realms win on intelligibility to humans and that requiring a dictionary to translate from area type abbreviations to area types will be a significant overhead for human and software parsers. (B) I was suggesting "s", rather than "zs", but I think your comments apply equally. I wouldn't want to use either "z0" or "s" for the base of the thermocline or tropopause. Both of these are surfaces defined by a transistion in stratification. We could either look for a letter or letters to stand for a stratification vertical coordinate, or use specific identifies such as "tp" and "tc". Is "h010" easier to parse than "10m"? I'm not sure. I suppose your proposal can be parsed with a shorter piece of code, but I think the difference is marginal. For humans, however, there is a big difference in legibility. (C) OK, I think the 6 classes would be cleaner. (D) OK .. perhaps we could offer two versions of the area type vocabulary for consultation: one short version and one expanded for legibility. I hope we can also offer an option of using the realms. (E) OK, that sounds good, regards, From: Taylor, Karl E. This email was received from an external source. Always check sender details, links & attachments. Thanks for giving this your careful attention. I’ve responded, at least partially, below. As you can guess, I’ve devoted considerable thinking (and way too much time) devising a reasonable easy-to-apply methodology to uniquely define the CMIP6 variables. Even so, you have suggested a few modifications that are certainly worth considering. I am hopeful that with further iteration we can reach a consensus on the best way forward with a proposal that the modeling groups and users might also seriously consider and provide helpful feedback. Best regards, From: Martin Juckes - STFC UKRI Hello Karl, Thanks for the clarifications. I'm relieved that we are not necessarily considering renaming large numbers of variables. I was intrigued by your statement that using "--" would require "many more" variables to be renamed and took a closer look. By my reckoning there would be around 100 variables that needed to be adjusted. After looking a bit closer I've come to agree that something closer to your proposal would be better. Especially because it would, as you have said, give more flexibility to accommodate an expected influx of new variables. Among those 100 there are a few which appear to be there because of errors/inconsistencies in the specification of the primary realm in the data request: either variables masked by land which are specified as "atmos" or variables masked by snow which are specified as "landIce". The convention needs to be clarified here, but I believe it should be that masking a global variable in a way which restricts the data to be within a spatially defined realm (i.e. land, ocean, landice or seaice) then the primary realm should be specified correspondingly. KET: >>>>> see comments below item (A) below. MJ: Advantages of "realm" over "areaLabelDD": KET response: I spent many days trying to find an objective, justifiable, and easy way to assign a primary realm and possibly some secondary realms to each variable. As you note, in the past “realms” helped group together variables that might be of particular interest to certain communities. But of course some variables are of interest to multiple communities, so there must be some arbitrariness in deciding which realm a variable belongs to – for example, surface temperature is of interest to those focusing on the atmosphere, but also those studying any of the surface realms). Or consider aerosol variables which could be considered to be in the “atmospheric physics” realm (affecting radiative transfer and cloud physics) and the “atmospheric chemistry” realm (affecting chemical reactions in the atmosphere). It can be difficult to decide which realm should be primary. KET response cont.: After failing to find an objective method by which a variable’s primary realm could invariably be determined, I realized that the “realm” is more often than not superfluous for unique identification of a variable. The realm is almost always determined by the standard name (e.g., air_temperature), which is already implied by the root name (already included in the branded variable label). So it is rarely necessary to additionally include the realm as a separate part of the branding suffix. What is needed for unique identification in some cases, however, is whether a variable has been sampled at all grid cells or only a subset of cells characterized by an area type that is specified in the “where” directive of a cell method. So it is the area type that is needed for unique identification, and that is why I have suggested using the area_types in the “where” directive for uniquely identifying variables. KET response cont.: That being said, I think it is essential that we continue to define for other purposes the realm or realms a variable belongs to. I think, for example, that a data request might want to group variables by realm (as we have in the past in distinguishing between Amon and Omon variables). Equally important, I think ESGF should include “realm” as a search facet so that users can filter the long list of variables to a subset that might be of most interest to them. I have not set down the rules for specifying the realm associated with each variable because I don’t think realm is needed to uniquely identify them. We will certainly need to do that in defining the full set of variable attributes, and in the spreadsheet you have access to (for CMIP6 variables), some of the hidden columns include realm designations determined through complex algorithms taking as input the standard_name and the cell_methods. MJ: KET response: I think that your approach is probably easier for humans to understand, and my approach might be slightly easier for a computer to parse: if the first numerical digit encountered is 0, then the field is not a function of the vertical coordinate. The problem of substituting “zs” for “z0” is that we need to accommodate variables that represent a quasi-horizontal slice at say the “tropopause” or the “base of the thermocline”, which cannot be defined by a single coordinate value and obviously are not “surface” variables. Maybe you have an idea on how to represent these differently? MJ: KET response: This, as you imply, is somewhat a judgement call. Those preparing data, I think, might like to know which variables represent time-means as opposed to some other more complex statistical measure like the maximum temperature within an interval, which generally would require a different sort of processing algorithm. That was the rationale for defining tavg, distinct from “tsum” or “tstat. That being said, “tsum” is only needed for 3 of the CMIP6 variables, so perhaps it should be eliminated and be included in the “tstat” category. Thus, we would end up with 6 classes, with your “interval” expanded into “tavg” and “tstat” (or some other appropriate name). MJ: KET response: Yes we agree on this. We might also agree that moving forward we should not necessarily adopt all the practices of the past. We might construct the root names of new variables (without any established history) differently from the past. MJ: KET response: “uax” is an interesting idea. Will think about it more. MJ: KET response: Yes, the root name would contain information that would be redundant with its areaLabelDD. There are I think about 8 variable roots that currently describe the main properties of surface vegetation (veg. height, veg. carbon content, CO2 exchanges, gross primary productivity, net primary productivity, litter carbon content, soil carbon content, and respiration). For each of the variables in CMIP6, we define multiple root names to specify which vegetation type is being characterized (including grass, shrub, tree, crop, pasture, leaf, wood, c3veg, c4veg). In CMIP6 we had no “areaLabelDD”, and so had to include the type of vegetation in the root name. If we had defined an areaLabelDD, we could have reduced the number of root names pertaining to vegetation to about ¼ the original number. Moreover, users (and perhaps more importantly software) would be able to determine the vegetation type by explicit rules, rather than interpreting the root name. KET response cont.: I do agree that a 2-letter areaLabelDD’s like “ng” may not be ideal. Perhaps these labels should be made longer (e.g., replace “ng” with “grass”). I also think we need to decide whether to modify the root names of past names that included the area type, or leave them unmodified, but recommend that future variables leave off the redundant information in the root because it will be recorded by areaLabelDD. MJ: KET response: I agree that once the WIP has a solid proposal (perhaps with some different options also on the table), we should seek input from others. regards, From: Taylor, Karl E. This email was received from an external source. Always check sender details, links & attachments. Thanks for reading the document so promptly and getting back to me. I agree that changing the root names of variables that have been in use for multiple phases of CMIP would be disruptive, and I think some of these at least should not, in fact, be changed. (More radically, I would be happy to retain all the CMIP6 root names unchanged if that would make the proposal acceptable.) In the draft spread sheet I was showing what could be done under the new branded naming convention, not that we would necessarily want to make all such changes. I also, wasn’t suggesting that we do away with tables. I was suggesting that the tables be constructed based on criteria independent of any use in uniquely identifying variables. So I don’t see that constructing unique names in a different way will impact the ease (or difficulty) MIPs have in constructing their data requests. I would note that whether the identifying text strings used to uniquely define variables are included in the root or in a suffix should not be a show-stopper. If we are to move away from table names as a suffix, the identifying information the table names have conveyed in CMIP6 (regarding what a variable represents) must be moved either to the root or the suffix. This requirement means we must define labels to represent at least the frequency, the temporal sampling (to distinguish, for example, between “point” and time-mean variables), the vertical sampling (to distinguish, for example, between model-level and pressure-level variables), the “horizontal” grid (to distinguish, for example, between 2-d gridded data and zonal means), and the area restriction (to distinguish between a variable reported only over ice vs. one reported over land). Note that a branded variable consisting of only the root plus two modifiers (like “tas-atmos-mon”) would be insufficient to accommodate the CMIP6 variables unless many more new (and disruptive) root names were defined. Under my proposal, I have included all 5 of the short text strings modifiers in the suffix (choosing not to disrupt many of the current root names). With these labels defined, I think that moving forward there will be less of a need to define new root names. To reiterate, we don’t have to modify existing root names to implement this new approach. Some existing root names could be simplified and shortened if the consensus is that doing so wouldn’t be too disruptive. For example ta700 could become “ta” since the pressure level would be recorded by the verticalLabel as “p0700”. But we might choose not to do this. In the future if a project wanted to collect ta at say 20 hpa and another project wanted to collect ta at say 950 hPa, we might recommend that these new variables be assigned the root name “ta” and let the suffix alone define the pressure level (“p020” and “p0950”, respectively). If we proceeded in this way, I think users will come to appreciate that the new approach is scalable and easy to understand, and I think they will be happy not to memorize too many root names, which may to some seem obscure and difficult to decipher because these roots have been constructed by individuals unaware of any general rules or conventions for making them understandable. As I noted in the proposal, for this to work smoothly with the past, two dictionaries will need to be created: one dictionary giving the original CMIP6 root name + table name and the corresponding new branded variable name; the other dictionary providing the same information but with the new branded name first. With this, existing codes could easily be modified to handle data stored with the new variable identifiers just as they handled these variables in CMIP6. (Note that similar tables could be constructed for CMIP5 and possibly CMIP3.) I’m looking forward to modifying the proposal as needed to make it acceptable. Best regards, From: Martin Juckes - STFC UKRI Dear Karl, I appreciate the effort that has gone into this, but I am concerned that the scope has expanded substantially since we started discussing a rationalisation of the file naming convention. In those initial discussions the view was that variable names would only need to be changed for a small number of variables. The initial plan was for a limited change between the CMIP cycles, what is now proposed looks like a disruptive change during the early stages of CMIP7. I don't think I can support it as it stands. The document does not address the prime reason for the proliferation of tables in CMIP6, which was an insistance of some MIPs of using existing tables for defining requirements which conflicted with a WGCM requirement that new vaiables from other MIPs shouldn't be added in to general requests automatically. We have, I hope, a clearer procedure now and, with CMIP IPO support, a better chance of avoiding communications going astray or being ignored. Still, I do agree that it would be a good idea to revise what has become a rather opaque system and removing the MIP table labels from the file names is a good approach. The current proposal appears to have renamed 301 out of 1313 root variable names. Some of these, such as "tas", "uas", etc, have been used by tens if not hundreds of thousands of scientists for years. For most of these people, who will have little interest in the details of the naming system, the change from "tas_Amon" to "ta-mon-tavg-h02-hxy-x" is likely to be confusing. Something like "tas-atmos-mon" would be more transparent and achieve the aim of doing away with the MIP tables. I suspect many thousands of scientists have many of the existing codes used repeatedly in data analysis and visualisation code and will find dealing with a transition very awkward. Sticking to a 3-part categorisation would mean having to use, as now, suffixes to distinguish between different versions of, for instance, monthly air temperature, but the approach used for commonly used variables has been in place for many years and does not need fixing. I'm also concerned that there has been very little consultation. A change from the names used in the past (when was "tas" introduced?) is going to require coding changes for thousands of users. I don't think we can introduce changes on this scale without consulting. It looks as if a desire to rationalise the way that, for instance, 6hrPlev fields are treated has influenced the design. This does not look proportionate to me: changes to really widely used variables which will inconvenience tens, perhaps hundreds of thousands of users being made for the sake of some abstract ideas about semantic structure of file names. The section on cell methods appears out of place. I also find it confusing that is presented in terms of adding over a discretised field but it is mostly used, in the CMIP request, to describe the intended sub-grid processing for which there is not generally an explicit discretisation. I do support the goal of retiring MIP tables from the file names, but I'm afraid I can't support this approach, regards, From: Taylor, Karl E This email was received from an external source. Always check sender details, links & attachments. Dear Paul, Matt, and Martin, (Matt, please forward to Martin for me!) (Let me know if the 3 mb attachment doesn't arrive) I am seeking your support for my now completed and comprehensive proposal for defining unique labels for CMIP variables. These labels would replace the root+table labeling we've used in the past (labels like "ta_Amon"). The new labeling has been proposed because the old method led to a proliferation of tables, and also a larger than necessary number of distinct variable root names, which are often quite long (20% of the root names exceed 10 characters). My feeling is that continuing any longer with this increasingly unmanageable approach will become increasingly difficult because it will not easily scale to meet future demands. The root names under the current system must distinguish variables that differ only in how the physical quantity is reported (e.g., both msftyrhompa and msftyzmpa represent the same physical variable, but one is reported on density surfaces while the other is reported on model levels). The new labeling allows both variables to share the same root and the distinction between them is made in the suffix, which in general specifies how the variable has been sampled. The new labels will facilitate community suggestions for new additions by clearly showing which variables have proved useful in the past. I think it also will prevent newcomers from wasting our time with suggestions of variables that already exist. As outlined in the google doc (https://docs.google.com/document/d/1FHqbU2qikt92mApcaEgYY-10O2_oJrTWK_pO4omwpo8), which you have already helped in preparing, I suggest we host all the variables that have been found useful in a single tables. The variables would be arranged in order of their standard_names. They might be rendered in several forms, including perhaps a spreadsheet like the one attached (but considerably simplified). A MIP's data request would be constructed by drawing on a subset of the variables in the master list. Note that defining this master list and its branded variables doesn't preclude a MIP from defining aliases for the branded variable labels and organizing the variables in tables (perhaps akin to the tables defined for CMIP6). It's only at the overarching foundational level that the branded variable labels must be maintained. To test the feasibility of the branded variable approach, I have applied it retrospectively to the CMIP6 variables. The excel spreadsheet attached displays in column BG the proposed branded variable labels. In addition I propose in column CP new "long_names" for each of the variables, which can for the most part be constructed using an algorithm that extracts information from attributes that are defined for each of the variables. These long names fully distinguish one branded variable from the next in text that is easily understood by scientists. Note that here we are only interested in the first of the 3 spreadsheets in the workbook (namely, the one named "CMIP6_branded_variables"). You'll notice that most of the columns in the spreadsheet have been hidden. The hidden columns contain information not directly relevant to defining the branded variable labels. Any procedure that prepares or accesses variables in the CMIP archive would undoubtedly be affected by the transition from root+table labels to branded variable labels, but there will be a one-to-one correspondence between them, so a simple dictionary could be used to translate from one to the other. I am convinced that as a foundation for data storage of MIP model output, the approach I've proposed is far superior to the present one, but we will need to make sure the modeling groups and users understand why we are doing it so they won't reject the changes (which don't need to be overly disruptive) just because they are not exactly the same in the past. I think progress requires we work toward that goal now, and I think your endorsement and support will be essential. I hope you will soon read the google doc describing the approach and perhaps spend a little time reviewing the spreadsheet. I look forward to your suggestions on how to proceed and if you would like, I would be happy to provide an overview remotely at your convenience. best regards, |
Beta Was this translation helpful? Give feedback.
-
In applying the proposed approach to constructing branded variables for CMIP6 variables (see google doc), I have also generated human-readable full descriptions of each variable that can be generated based on the same information relied on by branded variables. 90% of the descriptions are generated by code without human intervention. For the remaining 10%, short modifying descriptors are needed. The results of this exercise could in the future define the "long_name", and these names can be reviewed in column CO of this spreadsheet. The short modifiers supplied by me (for 10% of the variables) are found in column CN. Note that the primary purpose of this spreadsheet was to define the branded variable labels, which are given in column BG. |
Beta Was this translation helpful? Give feedback.
-
In response to an offline query, perhaps my verbose earlier postings obscured two points:
|
Beta Was this translation helpful? Give feedback.
-
This discussion pulls an email thread out of inboxes
Beta Was this translation helpful? Give feedback.
All reactions