Null point estimate output type IDs in configs instead of `NA`s #109

annakrystalli · 2024-10-21T07:58:27Z

@zkamvar posted the following note on #103

Something that was brought up in response to reichlab/variant-nowcast-hub#117 (comment) is that the "NA" is a bit confusing because it sure looks like a character, but when we expand the grid the output_type_id columns become NA (which is an intentional move by Ooms described in section 2.1.1 of the JSONlite package paper)

Now that we are using is_required for point estimate types, we might be able to take this opportunity to set the required property to a single element null array. This will have exactly the same result as the "NA" array, but with the following advantages:

inter-language compatibility can be achieved since null is a concept that even JSON can understand
we can clearly communicate that this should be a missing value in the data as opposed to a character string.

This is what I think it would look like in the schema:

"required": {
    "description": "Not relevant for point estimate output types. Must be a single array of null"
    "type": "array"
    "items": {
        "const": null,
        "maxItems": 1
    }
}

Demo

Here's a demo that shows that ["NA"] and [null] are equivalent by modifying a tasks.json file and reading them in with jsonlite

hub_con <- hubData::connect_hub(
  system.file("testhubs/simple", package = "hubUtils")
)
hp <- attributes(hub_con)$hub_path
# Get the tasks file which has a `mean` output type
tasks <- fs::path(hp, "hub-config", "tasks.json")

# copy the file to a new temp file
tmp <- withr::local_tempfile()
fs::file_copy(tasks, tmp, overwrite = TRUE)
# The two files are identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] TRUE

cfg <- readLines(tmp)
writeLines(cfg[231:240]) # optional is NA
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": ["NA"]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }

# Change '["NA"]' to '[null]'
nullcfg <- sub('["NA"]', '[null]', cfg, fixed = TRUE)
writeLines(nullcfg[231:240]) # optional is now null
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": [null]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }
writeLines(nullcfg, con = tmp) # changing to null

# The two files are no longer identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] FALSE
waldo::compare(cfg, nullcfg)
#> old[81:87] vs new[81:87]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","
#> 
#> old[232:238] vs new[232:238]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","


# The resulting R object after reading in JSON are identical
identical(
  jsonlite::fromJSON(tasks, simplifyDataFrame = FALSE), 
  jsonlite::fromJSON(tmp, simplifyDataFrame = FALSE)
)
#> [1] TRUE

^{Created on 2024-10-18 with reprex v2.1.1}

Originally posted by @zkamvar in #103 (comment)

The text was updated successfully, but these errors were encountered:

annakrystalli · 2024-10-21T08:11:53Z

I'm totally onboard with this! 💯

The non-interlanguage compatibility of NA is something that's been brought up in the past, especially by @LucieContamin, discussed, but somewhat shelved until the bridge needed crossing. I agree this seems like a very good opportunity to cross that bridge.

I think we could easily even go one step cleaner and rather than a [null] array, just go for a null property all together (i.e. required: null. In terms of implementation, expand_model_out_grid() already has functionality to convert task ID optional and required properties which are both null (used for task IDs which are relevant to one modeling task but might not be to another in the same round) to NAs so I think it would be straightforward to apply that to output_type_ids too.

Other areas that would require work would be:

to update documentation in much more detail, with consideration of what that value might actually look like in other languages but overall I think this would be really helpful for modeling teams.
update configs in example hubs (but I imagine that could just happen as part of updating example hubs to v4 anyways so now major additional work.
Double check that no dynamic checks performed during config validation are now obsolete or affected.

Overall, I think it's much cleaner, clearer and would make it much easier to communicate and explain the expectations for point estimate output type ids so worth the effort!

nickreich · 2024-10-21T16:47:57Z

I basically agree conceptually with the arguments. Could someone explain what a change like this would mean for what a submission file would look like? I.e. what would be included (if anything) in the output_type_id column when output_type="mean"?

What would the implications be for existing hubs that might try to migrate to this new schema format? E.g. are we at an early enough stage with the variant hub that we could just make this change and edit a few files and it would be fine? Does FluSight solicit mean forecasts and would they be stuck with schema 3.x?

annakrystalli · 2024-10-22T09:02:24Z

Nothing should really change for submissions, either current or future. We are still using R to validate so any missing values used should translate to NAs when read into R. This is the case for, e.g. values generated in python with pd.NA.

So from my perspective, if R is interpreting the values as NA, there should be no problems with validation or with opening datasets as a whole. I think this will be even more stable with parquet files as missing values are encoded in the parquet format and therefore are fully interoperable regardless of the framework generated. Having said the above it may well be wise to experiment with csv a bit to ensure mixing and matching such values from different framework doesn't cause problems, especially downstream when trying to open a dataset.

I've added some suggestions for testing in hubverse-org/hubDocs#198

github-project-automation bot added this to hubverse Development overview Oct 21, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Oct 21, 2024

This was referenced Oct 22, 2024

Test and document how to produce typed NA/missing values in Python. hubverse-org/hubDocs#198

Open

clarify missingness of point estimates; add examples for determining output_type_id format hubverse-org/hubDocs#197

Merged

annakrystalli added a commit that referenced this issue Oct 25, 2024

Encode point estimate output type IDs with null. Resolves #109

bfabc6d

annakrystalli moved this from Todo to Ready for Review in hubverse Development overview Oct 25, 2024

annakrystalli self-assigned this Oct 25, 2024

annakrystalli added the v4.0.0 label Oct 25, 2024

zkamvar mentioned this issue Oct 25, 2024

Encode point estimate output type IDs with null. #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Null point estimate output type IDs in configs instead of `NA`s #109

Null point estimate output type IDs in configs instead of `NA`s #109

annakrystalli commented Oct 21, 2024 •

edited

Loading

annakrystalli commented Oct 21, 2024 •

edited

Loading

nickreich commented Oct 21, 2024

annakrystalli commented Oct 22, 2024 •

edited

Loading

Null point estimate output type IDs in configs instead of NAs #109

Null point estimate output type IDs in configs instead of NAs #109

Comments

annakrystalli commented Oct 21, 2024 • edited Loading

Demo

annakrystalli commented Oct 21, 2024 • edited Loading

nickreich commented Oct 21, 2024

annakrystalli commented Oct 22, 2024 • edited Loading

Null point estimate output type IDs in configs instead of `NA`s #109

Null point estimate output type IDs in configs instead of `NA`s #109

annakrystalli commented Oct 21, 2024 •

edited

Loading

annakrystalli commented Oct 21, 2024 •

edited

Loading

annakrystalli commented Oct 22, 2024 •

edited

Loading