Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null point estimate output type IDs in configs instead of NAs #109

Open
annakrystalli opened this issue Oct 21, 2024 · 3 comments
Open

Null point estimate output type IDs in configs instead of NAs #109

annakrystalli opened this issue Oct 21, 2024 · 3 comments
Assignees
Labels

Comments

@annakrystalli
Copy link
Member

annakrystalli commented Oct 21, 2024

@zkamvar posted the following note on #103

Something that was brought up in response to reichlab/variant-nowcast-hub#117 (comment) is that the "NA" is a bit confusing because it sure looks like a character, but when we expand the grid the output_type_id columns become NA (which is an intentional move by Ooms described in section 2.1.1 of the JSONlite package paper)

Now that we are using is_required for point estimate types, we might be able to take this opportunity to set the required property to a single element null array. This will have exactly the same result as the "NA" array, but with the following advantages:

  1. inter-language compatibility can be achieved since null is a concept that even JSON can understand
  2. we can clearly communicate that this should be a missing value in the data as opposed to a character string.

This is what I think it would look like in the schema:

"required": {
    "description": "Not relevant for point estimate output types. Must be a single array of null"
    "type": "array"
    "items": {
        "const": null,
        "maxItems": 1
    }
}

Demo

Here's a demo that shows that ["NA"] and [null] are equivalent by modifying a tasks.json file and reading them in with jsonlite

hub_con <- hubData::connect_hub(
  system.file("testhubs/simple", package = "hubUtils")
)
hp <- attributes(hub_con)$hub_path
# Get the tasks file which has a `mean` output type
tasks <- fs::path(hp, "hub-config", "tasks.json")

# copy the file to a new temp file
tmp <- withr::local_tempfile()
fs::file_copy(tasks, tmp, overwrite = TRUE)
# The two files are identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] TRUE

cfg <- readLines(tmp)
writeLines(cfg[231:240]) # optional is NA
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": ["NA"]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }

# Change '["NA"]' to '[null]'
nullcfg <- sub('["NA"]', '[null]', cfg, fixed = TRUE)
writeLines(nullcfg[231:240]) # optional is now null
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": [null]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }
writeLines(nullcfg, con = tmp) # changing to null

# The two files are no longer identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] FALSE
waldo::compare(cfg, nullcfg)
#> old[81:87] vs new[81:87]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","
#> 
#> old[232:238] vs new[232:238]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","


# The resulting R object after reading in JSON are identical
identical(
  jsonlite::fromJSON(tasks, simplifyDataFrame = FALSE), 
  jsonlite::fromJSON(tmp, simplifyDataFrame = FALSE)
)
#> [1] TRUE

Created on 2024-10-18 with reprex v2.1.1

Originally posted by @zkamvar in #103 (comment)

@annakrystalli
Copy link
Member Author

annakrystalli commented Oct 21, 2024

I'm totally onboard with this! 💯

The non-interlanguage compatibility of NA is something that's been brought up in the past, especially by @LucieContamin, discussed, but somewhat shelved until the bridge needed crossing. I agree this seems like a very good opportunity to cross that bridge.

I think we could easily even go one step cleaner and rather than a [null] array, just go for a null property all together (i.e. required: null. In terms of implementation, expand_model_out_grid() already has functionality to convert task ID optional and required properties which are both null (used for task IDs which are relevant to one modeling task but might not be to another in the same round) to NAs so I think it would be straightforward to apply that to output_type_ids too.

Other areas that would require work would be:

  • to update documentation in much more detail, with consideration of what that value might actually look like in other languages but overall I think this would be really helpful for modeling teams.
  • update configs in example hubs (but I imagine that could just happen as part of updating example hubs to v4 anyways so now major additional work.
  • Double check that no dynamic checks performed during config validation are now obsolete or affected.

Overall, I think it's much cleaner, clearer and would make it much easier to communicate and explain the expectations for point estimate output type ids so worth the effort!

@nickreich
Copy link
Contributor

I basically agree conceptually with the arguments. Could someone explain what a change like this would mean for what a submission file would look like? I.e. what would be included (if anything) in the output_type_id column when output_type="mean"?

What would the implications be for existing hubs that might try to migrate to this new schema format? E.g. are we at an early enough stage with the variant hub that we could just make this change and edit a few files and it would be fine? Does FluSight solicit mean forecasts and would they be stuck with schema 3.x?

@annakrystalli
Copy link
Member Author

annakrystalli commented Oct 22, 2024

Nothing should really change for submissions, either current or future. We are still using R to validate so any missing values used should translate to NAs when read into R. This is the case for, e.g. values generated in python with pd.NA.

So from my perspective, if R is interpreting the values as NA, there should be no problems with validation or with opening datasets as a whole. I think this will be even more stable with parquet files as missing values are encoded in the parquet format and therefore are fully interoperable regardless of the framework generated. Having said the above it may well be wise to experiment with csv a bit to ensure mixing and matching such values from different framework doesn't cause problems, especially downstream when trying to open a dataset.

I've added some suggestions for testing in hubverse-org/hubDocs#198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Reviewed/Ready to Merge
Development

No branches or pull requests

2 participants