Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add way to refer to ref_time_value within f arg to epix_slide #170

Closed
brookslogan opened this issue Jul 27, 2022 · 6 comments
Closed

Add way to refer to ref_time_value within f arg to epix_slide #170

brookslogan opened this issue Jul 27, 2022 · 6 comments
Assignees
Labels
op-semantics Operational semantics; many potentially breaking changes here P1 medium priority

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Jul 27, 2022

Currently, epix_slide does not make the current ref_time_value (t in the current code) available to the supplied f. This is a problem, e.g., when forecasting, where the targets for ref_time_value should be observations at ref_time_value + aheads, but there is data reporting latency, so data are only available for time_values through ref_time_value - latency. We are left with either incorrectly pretending that max(.x$time_value) is the same as ref_time_value, or making some guess or inconvenient calculation of the latency, so we can say ref_time_value = max(.x$time_value) + latency_guess_or_calculation.

Instead, we should make ref_time_value available to f. Approaches to passing:

  • Metadata of .x: doesn't work, we don't provide an epi_df. [but same issue of potentially being overlooked]
  • Attributes of .x: works across all forms of f, but is easily overlooked.
  • Extra args/pronouns: probably the best user interface, but more complicated to implement. Notes below assume that we take this approach.

Current epi[x]_slide f functions take:

  • window data
  • group key (1-row tibble with values of the grouping cols)
  • any additional args the user would like to pass via dots

Since epi_slide implementation might be trickier and take longer, it makes sense to make epix_slide's f functions take the following:

  • window data
  • group key (1-row tibble with values of the grouping cols)
  • ref_time_value
  • any additional args the user would like to pass via dots

(If instead we implement the corresponding epi_slide change simultaneously, we could place the ref_time_value before the group key if we felt it made more sense. I don't think one approach is dominant in this case, though. Often we group by geo, so group key followed by ref_time_value is sort of a geo value followed by sort of a time value, which seems natural given how we order epi_df columns. However, sometimes we'll group by nothing, group key followed by ref_time_value would be something useless followed by something useful, which is a bit unnatural.)

--- Implementation ideas: ---

Passing ref_time_value as a third arg when f is a function may require:

Passing ref_time_value when f is a formula might require:

  • Nothing beyond the above: maybe we would already be able to reference it in the formula as ..3 or ref_time_value
  • Somehow introducing a new data pronoun or data mask (using rlang::as_data_{pronoun,mask} + other rlang stuff?), still using purrr::map_dfr.
  • Some approach not using purrr::map_dfr with more freedom.

Passing ref_time_value when f is missing & ... contains summarize-like operations:

  • Somehow introduce a new data pronoun or data mask?
  • Change the execution environment, e.g., by setting the environment of f to a new child of itself, and binding ref_time_value within that child environment. (Careful, check whether we need to do something to copy f to avoid modifying the real f's environment.)
  • [Make cur_group() work and mimic it with, e.g., cur_ref_time_value().]
@brookslogan brookslogan added P1 medium priority op-semantics Operational semantics; many potentially breaking changes here labels Jul 27, 2022
@brookslogan
Copy link
Contributor Author

This is already possible in the current interface, as epix_slide now provides computations their first input as an epi_df containing the appropriate as_of metadata. (A point of uncertainty: if we are grouping by geo_value which breaks the validity of the epi_df this might not be the case or might be brittle.) We still may want a more convenient/robust/unified interface, though.

@nmdefries
Copy link
Contributor

Here are two possible ways to be able to access ref_time_value from expressions in slide ... based on the suggestions above for "[p]assing ref_time_value when f is missing & ... contains summarize-like operations":

  • Add pronoun to data mask used in eval_tidy. Use like epix_slide(xx1, before = 2, slide_value = sum(time_value) - .env$ref_time_value) (.env interferes with rlang::.env but that's a detail of implementation + easy to change)
  • Modify quo's env. Use like epix_slide(xx1, before = 2, slide_value = sum(time_value) - ref_time_value)

@nmdefries
Copy link
Contributor

On further discussion with Logan, we'd prefer to match dplyr's handling, where cur_group() fetches the grouping variable values for the current group. We would make our own cur_*-like function for the ref time value.

There are additional dplyr functions we may want to support: cur_data() (referring to the non-grouping columns for that group) and cur_data_all() (referring to to all columns for that group) that have been deprecated, and pick() added as a replacement to cover some use cases.

This looks pretty complicated to implement and would also depend on non-exported dplyr functions and classes.

dplyr::cur_group() is a wrapper for a method of the internal dplyr::DataMask R6 class. The data mask is fetched from the environment by name and, if not found, the computation exits with an error (currently happens when trying to use cur_group() in a dots-provided slide computation). The data mask is created in each dplyr function call with settings specific to the requested computation type (via verb).

@nmdefries
Copy link
Contributor

nmdefries commented May 10, 2023

We could support use of cur_group(), etc, using the env-modifying solution above so that the interface matches dplyr but the implementation doesn't (since the backend is obfuscated anyway). Cons here are that we'd have to exhaustively implement the functionality of dplyr tidy usage instead of just inheriting it, so we could end up with syntax that only partially matches that of dplyr -- probably more confusing that just not matching dplyr at all.

@nmdefries
Copy link
Contributor

On further discussion with @ryantibs, plan to implement one of the easy approaches above to make the ... act like a formula and consider revisiting the cur_group() support later.

@nmdefries
Copy link
Contributor

Closed by #313 and #317.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
op-semantics Operational semantics; many potentially breaking changes here P1 medium priority
Projects
None yet
Development

No branches or pull requests

2 participants