Skip to content

group_by() for epi_archive objects #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ryantibs opened this issue Apr 5, 2022 · 10 comments
Closed

group_by() for epi_archive objects #64

ryantibs opened this issue Apr 5, 2022 · 10 comments
Assignees
Labels
op-semantics Operational semantics; many potentially breaking changes here P0 high priority

Comments

@ryantibs
Copy link
Member

ryantibs commented Apr 5, 2022

Should we implement group_by() as public method in the epi_archive object? We could do that since it's an R6 object, and when group_by() is called, we could just have it set a private field called group_keys to whatever variables are passed.

The only downstream behavior this would affect is the as_of() and slide() methods for the epi_archive object. (These are its own public methods.) This would either return a group epi_df snapshot, or do a grouped sliding computation, respectively.

Pros:

  • Would allow you for an epi_archive object x to do:
     x %>% group_by(geo_value) %>% epix_slide(col = my_fun(...))
    
    to do a sliding computation grouped by geo_value. This would be precisely analogous to how we would do it on an epi_df object with epi_slide(). Currently we have to set the grouping variables as an argument to epix_slide().

Cons:

  • After calling epix_slide(), the object x above would still be grouped. This is because R6 objects obey reference semantics. So we'd have to ungroup() it, which may be a bit counterintuitive since it's not how dplyr verbs work on data frames or tibbles.

@lcbrooks @dajmcdon @jacobbien What do you guys think?

@dajmcdon
Copy link
Contributor

dajmcdon commented Apr 7, 2022

I feel like the Pro is pretty major, and the Con is pretty minor. But maybe I'm not representative.

My potential confusion:

For dplyr, group_by() followed by summarize() results in an ungrouped object while it remains grouped after mutate(). It feels to me that _slide() is more like mutate() then summarize(), so maybe leaving it grouped isn't so odd? (With the caveat that all of this involves assignment to a result).

On the other hand, the grouping/not behaviour in dplyr is one of those things that I often forget about, and I find myself erroneously assuming something is ungrouped. This has happened many times, so maybe I'm an outlier in terms of actually absorbing the dplyr logic.

@brookslogan
Copy link
Contributor

brookslogan commented Apr 7, 2022

I think the drawback Ryan is pointing out is not that we need to group(....) %>% epix_slide(.....) %>% ungroup(....) to get no groups, but rather, is this situation:

x = <ungrouped epi_archive>
z1 = x %>% epix_slide(<stuff1>)
y = x %>% group_by(grpvar) %>% epix_slide(<stuff2>)
z2 = x %>% epix_slide(<stuff1>) # different from z1!

epi_slide is like mutate, but epix_slide is more like summarize [except it still partially broadcasts]; while the former only adds columns, the latter will be dropping nonkey nongroup columns + outputting a different class result (epi_archive --> epi_df). I believe that, until recently, summarize left things grouped anyway or wouldn't drop them completely; then it moved to .groups="drop_last" + a message by default; now it seems to be .groups="drop_last" + no message by default. So users should be used to worrying about what happens to their groups... but maybe we could provide a .groups option with a noisy default as well?

[This con] would be [re]solved by moving away from R6+data.table for epi_archive and instead using list+lazy_dt, or having epi_archive be a non-reference-semantics wrapper on top of an R6 EpiArchive backend. We decided this move was low priority though.

@brookslogan
Copy link
Contributor

brookslogan commented Apr 7, 2022

Addressing/ameliorating the Con:

  • Higher effort: move to list + lazy_dt
  • Medium effort: group_by should produce a "grouped_epi_archive" object, which records the groups + a pointer back to the (unmodified) epi_archive
  • Low effort: as Ryan described, but have first step of epix_slide be to record the groups, then (side-effectfully) removing the group setting from the epi_archive. The con is now if someone did grouped_x = group_by(x, grpvar); grouped_x %>% epix_slide(<stuff1>); grouped_x %>% epix_slide(<stuff1>) they'd get different answers. Based on how I've used group_by and summarize in the past, I think this would be less likely to encounter compared to situation I noted in the post above [.... but I'm not sure on second thought]. (((We could prevent this by warning or stopping if there is no group setting in epix_slide. (To use no groups they would have to do x %>% group_by() %>% epix_slide(.....) or something like that.))))

@dajmcdon
Copy link
Contributor

dajmcdon commented Apr 7, 2022

Oh, I see. Thanks Logan!

@ryantibs
Copy link
Member Author

ryantibs commented Apr 15, 2022

@lcbrooks @dajmcdon What is your current thinking on this?

Based on our conversation yesterday, it seems like we want to abide by the rule:

Functions applied to R6 objects should not be side-effectful. Only public member functions should be side-effectful.

So that means that we could go with something like Logan's "medium" solution: group_by() should clone everything in the epi_archive object EXCEPT the underlying data.table, and return it as an grouped_epi_archive object.

Are we OK with the fact that we don't clone the underlying data.table? I think so, provided we record this very clearly in the documentation. Plus, we could think of this as not an "R6 special case handling", but a "data table special case handling". Data tables are not cloned, they are typically handled in memory.

Thoughts?

@brookslogan
Copy link
Contributor

brookslogan commented Apr 15, 2022

Thinking about this clone-based "medium" solution:

Just regular archive$clone() might work; I think it's shallow by default (and don't know whether the deep clone feature actually recognizes data.tables and does the special copy operation that would be required). Eventually we may want to move to the "higher effort" solution... we might want to document that while currently, it will be pointing to the same DT as the original, this might change in the future to use something that will copy on write, so users shouldn't rely on this pointer behavior?

There's a bit of complication due to the combination of S3 and R6. A couple of decisions to make:

  • Does group_by output S3 class c("grouped_epi_archive","R6"), c("grouped_epi_archive", "epi_archive","R6"), or c("epi_archive","R6")?
    • A: The first one requires writing an implementation for every S3 method that epi_archive implements and forwarding to the epi_archive implementation. If users write an S3 implementation for an epi_archive they will need to also write one for grouped_epi_archives. But we will also not accidentally automatically get non-grouped behavior for any methods that we forgot to write special grouped implementations for.
    • B: The second one has this behavior automatically and matches dplyr's approach, but will have an S3 triple class which might not be normal for R6 classes, which could be confusing
    • C: Similar to B, but could be inappropriate if there are S3 functions that should only apply to grouped epi archives
  • Does group_by output R6 class grouped_epi_archive or epi_archive (with grouping support built into epi_archive)?
    • 1: the first one requires thinking about / dealing with R6 inheritance
    • 2: the second one might be inappropriate if there are R6 functions that should only apply to grouped epi archives

Top contenders are

  • A1: S3 class is c("grouped_epi_archive","R6"), R6 class is "grouped_epi_archive"
  • B1: S3 class is c("grouped_epi_archive","epi_archive","R6"), R6 class is "grouped_epi_archive"
  • C2: S3 class is c("epi_archive","R6"), R6 class is "epi_archive"

A1 involves more S3 implementations, but could also catch some issues when we don't want to directly inherit epi_archive behavior. B1 is closest to dplyr, but might or might not confuse R6 users with the triple class. C2 might be a little confusing with group methods listed for ungrouped archives.

I can't see a clear winner here. Maybe A1 or B1 over C2.

@ryantibs
Copy link
Member Author

ryantibs commented May 1, 2022

@lcbrooks Thanks Logan for the super detailed proposal and analysis!

I'm in favor of B1. The only negative you point out is that it has a triple class, but I don't really see this as a negative to be honest.

Tagging @dajmcdon to see if he has any further thoughts, but barring that, I think to get the ball rolling, you could go ahead and implement this. Seems like it could be a good idea to address #67 in the same PR since it's a related issue.

@brookslogan brookslogan self-assigned this May 9, 2022
@brookslogan brookslogan added P2 low priority op-semantics Operational semantics; many potentially breaking changes here P0 high priority and removed P2 low priority labels May 24, 2022
@brookslogan
Copy link
Contributor

Returning to this. I was trying to write a justification for B1 in a comment, and in doing so convinced myself to switch to A1, because looking at, e.g., the backcasting preprocessors and compactify, I think we're going to want to force ourselves&users be precise about when we're dealing with an ungrouped vs. a grouped epi_archive. And the overhead is comparable to B1, because with B1, we still need to check every inherited function to make sure it's compatible; it's just that if we forget to implement a method in A1, we get errors about unimplemented methods, which can be worked around, and if we forget to specialize/disable in B1, we get invalid results. I'm also adding a safeguard against re-grouping a grouped epi_archive. The epipredict development should hopefully give us an idea of how this all works out.

@brookslogan
Copy link
Contributor

brookslogan commented Aug 3, 2022

I have what looks like a working implementation on lcb/grouped_epi_archive. I originally hoped to keep the group_by argument to epix_slide, but with a deprecation warning until we discussed what to do with it; that is what is currently implemented. However, with the group_by function making this more like dplyr, we have the dplyr-based (and epi_slide-based) expectation that the default grouping will be no groups, not the key minus version&time_value. So if we are already making a breaking change to the default epix_slide grouping, it might make sense to simultaneously make the breaking change of removing the group_by parameter, especially since the existing approach adds a groups parameter (mirroring summarize's .groups), which might get a little confusing next to group_by. Assuming no objections, the remaining steps are:

  • default epix_slide grouping: change from non-version, non-time part of the key with message ---> no grouping, no message --- consistent with summarize
  • change from deprecated group_by arg (allow, but warn) ---> remove
  • consider whether to try to catch a group_by arg being passed and raise error informing of interface change.
  • [file as a separate issue; should consider naming together with any similar selectors for epi_df] add tidyselect helpers for the key columns, similar to how tidymodels has all_predictors()
  • resolve or highlight differences between args of this group_by... they are tidyselect rather than data masking, and don't allow definition of new columns
  • consider %>% group_by vs. $group_by desired behavior
  • [not now; too much effort for what seems like no real additional functionality] add method forwarding for additional methods (with appropriate checks, e.g., for merge and fill_through_version, probably that the grouping vars are among the nonversion key vars if we'd do something incompatible with groupsplit-map-bind)? Currently, the only thing to do with a grouped_epi_archive is to print, slide, group_by with .add=TRUE, or ungroup. We could keep it that way to indicate the only operations that are worth grouping for, or forward/implement more groupsplit-map-bind compatible methods.
  • merge in n -> before
  • address inconsistency of proposed epix_slide with epi_slide and existing epix_slide regarding whether grouping variables are accessible in .x vs. put into .y; ties into whether .x is a valid epi_df
    • The change from group_modify to summarize didn't include passing a dplyr::cur_group(), so the group key isn't received. [ --- Addressed by moving back to group_modify in epix_slide implementation.]
    • Using dplyr::cur_data_all() includes the columns we'd need to keep to ensure the first arg to each f call is an epi_df, but actually doesn't keep the epi_df class or metadata, and including these columns also differs from the current epi_slide behavior, impacting the output class and metadata. [Might be an epi_df S3 method issue.] -----> Make it (like) dplyr::cur_data(), with epi_df class if it's valid, else a tbl_df. Check that this is consistent with epi_slide --- it is, but it's not consistent with the old epix_slide behavior (which used .keep=TRUE vs. epi_slide's .keep=FALSE) --- a change to be documented. Might need to go back to group_modify if getting cur_data() takes too much work to get in right format, but this change will require some way of getting the groups argument behavior (+ checking treatment of group_by additional args) --- or to just not have this groups behavior and mirror group_modify again rather than summarize --- but neither is compatible with the definite ungroup-ing of old epix_slide, which also did not match epi_slide. [ --- Addressed by moving back to group_modify in epix_slide implementation and removing groups arg.]
    • The current groups default behavior is incompatible with the behavior of epi_slide. [But that matches with the mutate-like vs. summarize-like behavior of epi_slide vs. epix_slide.] [ --- Addressed by removing ungrouping in epix_slide implementation.]
  • fix changes being made only to one of the two group_modifys doing the same thing for different types of f, ... (TODO: check for / file issue to refactor so there is no duplication, here and in epi_slide)
  • fix test failure due to cur_data_all() dropping class & metadata
  • note broadcasting differences vs. epi_slide
  • fix references to n time steps remaining in n -> before, after; maybe just here to avoid remerge. Also fix "running window", "over variables in x". Also check in epi_slide docs after merging n -> before, after there.
  • fix "are a similar but different:"
  • "2. Note that" -> "2."
  • group epix_slide tests together
  • if still using summarize for sliding, check whether dplyr use in f could lead to this issue and address / document how to avoid [--- did not check; back to using group_modify because cur_data_all() drops epi_dfness, unlike group_modify first arg to computation]
  • update&add docs
  • update&add tests [some of the tests to add: grouped_epi_archive grouping behavior, maybe some alias&mutate contracts&assumptions]
  • update vignettes
  • update changelog; make sure to mention breaking changes like default group_by behavior, default groups behavior, removed group_by parameter, ways to access grouping variables in f
  • [, ..cols] -> [, cols, with=FALSE] to silence false positive check() note [ --- did nothing; no longer seeing this note; maybe updated checks don't trigger]
  • rename groups parameter -> .groups, to match summarize? or do with rest of Consider dot-prefixing all parameter names in functions where we take in user-controlled named ... #162 [no longer applies; groups parameter has been removed]
  • Change $print for grouped_epi_archive to print [forwarded/specialized methods, additional methods, missing methods] wrt to epi_archive.
  • Update "#' 2. epix_slide() uses a group_by to specify the grouping upfront (in"
  • Consider providing an epix_group_by function delegating to group_by and epix_ungroup for ungroup? ---> No, at least not without also creating epi_group_by and epi_ungroup for epi_dfs.
  • Forbid "renaming" in ungroup, or match dplyr behavior of just not renaming (would require fetching current names at integer indices rather than "new" name indices).
  • Check behavior of partial ungroup on non-grouping variables. [--- we match dplyr behavior of allowing non-current-grouping variables in the vars-to-"remove"-from-grouping; not sure if this is the best decision]
  • Address epix_slide won't grab the latest version #153 to make simple examples possible.
  • Adjust or deprecate all_rows for epix_slide. [migrated]
  • ungroup after slides in vignettes.
  • Repair vignettes and pkgdown.
  • Consider replacing data objects with compactified versions if that's possible without breaking tests now. (& re-updating vignettes) [migrated]
  • Maybe also implement groups and ungroup for ungrouped epi_archives. Good for making routines that work on both without needing a %>% group_by() to add trivial grouping, although at the cost of making it easier to introduce bugs related to using some method exclusive to grouped or ungrouped methods and thinking it works generally because it worked on one of them. [hold off]
  • Follow through on making epix_slide more summarize-like, without implementing an epi_slide_summarize, and discussing in documentation, being careful to remove descriptions of size stability and exact forecast compatibility with epi_slide if not true [migrated]
  • Debug installation issue that sometimes appears, described below.
Installation issue
renv::install("cmu-delphi/epiprocess@lcb/grouped_epi_archive")
Retrieving 'https://api.github.com/repos/cmu-delphi/epiprocess/tarball/ae02a8774a11f4b952e82360622759e7fdad3ab9' ...
	OK [downloaded 934.5 Kb in 1 secs]
Installing tidyselect [1.2.0] ...
	OK [linked cache]
Installing epiprocess [0.5.0.9999] ...
	FAILED
Error installing package 'epiprocess':
======================================

* installing *source* package ‘epiprocess’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package ‘epiprocess’
    finding HTML links ... done
    archive_cases_dv_subset                 html  
    as_epi_archive                          html  
    as_epi_df                               html  
    as_tsibble.epi_df                       html  
    detect_outlr                            html  
    detect_outlr_rm                         html  
    detect_outlr_stl                        html  
    epi_archive                             html  
    epi_cor                                 html  
    epi_df                                  html  
    epi_slide                               html  
    finding level-2 HTML links ... done

Rd warning: /tmp/RtmpnCpBWJ/renv-package-new-4c8c436a8145/epiprocess/man/epi_slide.Rd:23: missing link ‘grouped’
    epiprocess                              html  
    epix_as_of                              html  
    epix_fill_through_version               html  
    epix_merge                              html  
    epix_slide                              html  
    group_by.epi_archive                    html  
Error: /tmp/RtmpnCpBWJ/renv-package-new-4c8c436a8145/epiprocess/man/group_by.epi_archive.Rd:34: Bad \link text
* removing ‘/home/fullname/files/tooling/live-dogfooding-2023-02-01/renv/staging/1/epiprocess’
Error: install of package 'epiprocess' failed [error code 1]
Traceback (most recent calls last):
11: renv::install("cmu-delphi/epiprocess@lcb/grouped_epi_archive")
10: renv_install_impl(records)
 9: renv_install_staged(records)
 8: renv_install_default(records)
 7: handler(package, renv_install_package(record))
 6: renv_install_package(record)
 5: withCallingHandlers(renv_install_package_impl(record), error = function(e) {
        vwritef("\tFAILED")
        writef(e$output)
    })
 4: renv_install_package_impl(record)
 3: r_cmd_install(package, path)
 2: r_exec_error(package, output, "install", status)
 1: stop(error)

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3          lubridate_1.8.0       tsibble_1.1.1        
 [4] lattice_0.20-45       tidyr_1.2.0           pak_0.3.0            
 [7] prettyunits_1.1.1     ps_1.7.1              assertthat_0.2.1     
[10] rprojroot_2.0.3       utf8_1.2.2            R6_2.5.1             
[13] genlasso_1.5          ggplot2_3.3.6         pillar_1.7.0         
[16] rlang_1.0.4           rstudioapi_0.13       data.table_1.14.2    
[19] callr_3.7.0           Matrix_1.4-0          slider_0.2.2         
[22] desc_1.4.1            devtools_2.4.3        stringr_1.4.0        
[25] fabletools_0.3.2      igraph_1.3.2          munsell_0.5.0        
[28] anytime_0.3.9         compiler_4.1.3        xfun_0.31            
[31] pkgconfig_2.0.3       pkgbuild_1.3.1        tidyselect_1.1.2     
[34] tibble_3.1.7          roxygen2_7.2.0        fansi_1.0.3          
[37] crayon_1.5.1          dplyr_1.0.9           withr_2.5.0          
[40] grid_4.1.3            distributional_0.3.0  gtable_0.3.0         
[43] lifecycle_1.0.1       DBI_1.1.3             magrittr_2.0.3       
[46] scales_1.2.0          warp_0.2.0            cli_3.6.0            
[49] stringi_1.7.6         cachem_1.0.6          farver_2.1.0         
[52] renv_0.15.5           fs_1.5.2              remotes_2.4.2        
[55] xml2_1.3.3            ellipsis_0.3.2        epiprocess_0.5.0.9999
[58] generics_0.1.2        vctrs_0.4.1           tools_4.1.3          
[61] glue_1.6.2            purrr_0.3.4           processx_3.6.1       
[64] pkgload_1.3.0         fastmap_1.1.0         colorspace_2.0-3     
[67] sessioninfo_1.2.2     memoise_2.0.1         knitr_1.39           
[70] feasts_0.2.2          usethis_2.1.6        

@brookslogan
Copy link
Contributor

Musings on group_by vs. $group_by:

Having $group_by mutate the DT with set may not make sense, as it just wastes space with copies:

library(data.table)
DT = data.table(a = 2:6, b=3:7)
c = 1:5 # potentially uses ALTREP
d = c(1,5,2,6,3) # probably no ALTREP
old_DT_address = address(DT)
old_a_address = address(DT$a)
set(DT, , c("b", "c", "d"), list(NULL, c, d))
address(DT) == old_DT_address
#> [1] TRUE
address(DT$a) == old_a_address
#> [1] TRUE
address(DT$c) == address(c)
#> [1] FALSE
address(DT$d) == address(d)
#> [1] FALSE

Created on 2022-09-14 with reprex v2.0.2

It would stay more true to the data.table model, though, but I'm not sure if there's a case where that is useful. I believe that the data.table model is that generally (but not in all cases as with as.list), if you have a unique pointer to the data.table, it should contain unique pointers to each of the columns. So aliasing some columns would generally be forbidden (and so set copies the columns before adding). But it seems very rare to directly mutate column contents rather than column pointers, so trying to reduce the cases where the former could be problematic at the cost of extra copies in the latter, much more common, scenario, doesn't seem that useful.

So maybe we have $group_by break the typical data.table model and make the user worry about aliases not only when mutating things that could be aliasing data tables, but also contents of vectors that could be aliasing data table columns. But it could still be different than group_by; e.g., $group_by might set the DT field to its mutated value. There might be some debate about what's most easily flexible, especially if we don't expose the partially-aliasing limited detailed mutate. However, probably a stronger argument is to not allow this because it would be mutating one thing (original archive's DT) and returning a different thing (grouped archive), rather than returning the same thing invisibly.

--- Conclusion: ---
Having $group_by act in the same way as group_by seems to be the cleanest approach. (This is the current approach in the feature branch, so we can just check this item off the list.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
op-semantics Operational semantics; many potentially breaking changes here P0 high priority
Projects
None yet
Development

No branches or pull requests

3 participants