Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional metadata in attribute definition #17

Open
ashiklom opened this issue Sep 15, 2020 · 3 comments
Open

Additional metadata in attribute definition #17

ashiklom opened this issue Sep 15, 2020 · 3 comments

Comments

@ashiklom
Copy link
Contributor

Per the discussion today, were we looking for something like this? The general idea is that attributeDefinition has the format [variable_type]{Variable definition...}.

library(magrittr, include.only = "%>%")

attributes <- tibble::tribble(
  ~attributeName,     ~attributeDefinition,                      ~unit,                  ~formatString, ~numberType, ~definition,
  "time",              "[dimension]{time}",                                    "year",                "YYYY-MM-DD",  "numberType", NA,
  "depth",             "[dimension]{depth in reservior}",                      "meter",                NA,           "real",       NA,
  "ensemble",          "[dimension]{index of ensemble member}",                "dimensionless",        NA,           "integer",    NA,
  "species_1",         "[statevariable]{Population density of species 1}",         "numberPerMeterSquared", NA,          "real",       NA,
  "species_2",         "[statevariable]{Population density of species 2}",         "numberPerMeterSquared", NA,          "real",       NA,
  "data_assimilation", "[flag]{Flag whether time step assimilated data}", "dimensionless",        NA,           "integer",    NA
)
attributes
#> # A tibble: 6 x 6
#>   attributeName  attributeDefinition  unit    formatString numberType definition
#>   <chr>          <chr>                <chr>   <chr>        <chr>      <lgl>     
#> 1 time           [dimension]{time}    year    YYYY-MM-DD   numberType NA        
#> 2 depth          [dimension]{depth i… meter   <NA>         real       NA        
#> 3 ensemble       [dimension]{index o… dimens… <NA>         integer    NA        
#> 4 species_1      [statevariable]{Pop… number… <NA>         real       NA        
#> 5 species_2      [statevariable]{Pop… number… <NA>         real       NA        
#> 6 data_assimila… [flag]{Flag whether… dimens… <NA>         integer    NA

parse_attribute_definition <- function(string) {
  regex <- "\\[(.*?)\\]\\{(.*?)\\}"
  m <- regexec(regex, string)
  result <- regmatches(string, m)
  output <- do.call(rbind, result)[,-1]
  colnames(output) <- c("variable_type", "variable_definition")
  output
}

parse_attribute_definition(attributes$attributeDefinition)
#>      variable_type   variable_definition                      
#> [1,] "dimension"     "time"                                   
#> [2,] "dimension"     "depth in reservior"                     
#> [3,] "dimension"     "index of ensemble member"               
#> [4,] "statevariable" "Population density of species 1"        
#> [5,] "statevariable" "Population density of species 2"        
#> [6,] "flag"          "Flag whether time step assimilated data"

Created on 2020-09-15 by the reprex package (v0.3.0)

@mdietze
Copy link
Contributor

mdietze commented Sep 15, 2020

looks good to me. I think we'd just want a concrete list of the allowable variable_types. I think I'd add: driver, parameter, random_effect, observation, observation_error, process_error (obviously we'd update this list if we update the uncertainty list), and diagnostic (since @rqthomas mentioned this was useful in his files). Two (related) questions I'd have:

  • do we need to have an initial_condition type and a statevariable type or are they always one and the same? note: the current standard does propose an optional <attributeName> listing within the <assimilation> additionalMetadata to allow users to ID which variables are being updated, with the understanding that those names would have to match something in the <attributeList>.
  • Is it possible for a single variable to have more than one type?

@rqthomas
Copy link
Contributor

One case to consider is a flux (so it isn't a state) that is assimilated (so it isn't a diagnostic). This would fall through the classification cracks. Also, is there an easier regex to parse. I just use a colon ":" to separate the variable_type from the actual long name. However, what you present is cleaner to read and if the average user isn't going to have to right complex regex statements then I am fine with your proposal.

@ashiklom
Copy link
Contributor Author

ashiklom commented Sep 17, 2020

I think we'd just want a concrete list of the allowable variable_types.

Yup, this can be implemented as a factor, and we can throw errors if the result has any NAs.

do we need to have an initial_condition type and a statevariable type or are they always one and the same?

I'm inclined to think they're the same, but I'm open to counterexamples.

Is it possible for a single variable to have more than one type?

I think we should define our types to avoid this if at all possible (i.e., if this is possible, then we haven't defined our types well). From an implementation standpoint, there's no reason we couldn't implement multiple types with either [type1][type2]{description} or [type1|type2]{description} (or similar), but everything is simpler (conceptually and for implementation) if a variable can only have one type.

One case to consider is a flux (so it isn't a state) that is assimilated (so it isn't a diagnostic)

Even though it breaks ontogenies, I'd probably be OK calling that a "state".

Also, is there an easier regex to parse.

I picked this regex specifically for its parseability. As long as we define just a few simple rules— the [type] has to come first, no [] characters inside the type, and no characters after the description, the following should be pretty robust to just about any input. Note that the ? in the first .*? specifies a non-greedy regex, so it will find the shortest string before a ] (rather than the default, which is greedy and will find the longest match; that could slurp up [] in the description). I also added a few * to make this robust to whitespace.

"^ *\\[(.*?)\\] *\\{(.*)\\} *$"

if the average user isn't going to have to right complex regex statements

Yeah, definitely not. The regex will be hard-coded in a parse_variable() or similar function in this package or elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants