Skip to content

Check/test all_rows, improve docs #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
brookslogan opened this issue Jul 17, 2022 · 7 comments
Closed
3 tasks

Check/test all_rows, improve docs #147

brookslogan opened this issue Jul 17, 2022 · 7 comments
Labels
op-semantics Operational semantics; many potentially breaking changes here P1 medium priority

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Jul 17, 2022

  • The join in all_rows [in epix_slide] looks like it will result in a huge expansion of rows if group_by isn't all non-version non-time key cols. Check whether this is an issue.
  • Improve documentation. I can't understand what it's doing from the docs. It looks like maybe the difference between mutate and summarize, except maybe still requiring 1-row outputs on the "mutate" version? [No, it's not. Or there is a bug when specifying a single ref time value and/or other situations causing a lot of extra rows to be added even when not in the above situation.]
  • Consider renaming all_rows
@brookslogan brookslogan added P1 medium priority op-semantics Operational semantics; many potentially breaking changes here labels Jul 17, 2022
@brookslogan brookslogan changed the title Check/test all_rows, consider changing to just return a filtered archive Check/test all_rows, improve docs Jul 21, 2022
@nmdefries
Copy link
Contributor

The join in all_rows looks like it will result in a huge expansion of rows if group_by isn't all non-version non-time key cols. Check whether this is an issue.

If group_by isn't set by the user, the auto-generated value never includes time_value or version. If the user tries to set group_by so that it includes time_value or version , they get an error (either "Error in dplyr::group_by(): ! Must group by variables found in .data. ✖️ Column version is not found." or "Error in dplyr::group_modify(): ! The returned data frame cannot contain the original grouping variables: time_value."). So it's not possible to group by either of those fields.

We could implement our own error, checking if group_by includes either of those columns, to make the desired behavior explicit.

In any case, I don't think there's a concern about accidentally generating a ton of duplicate or unincluded rows.

With current behavior, toggling on all_rows takes the slide output (for whatever reference dates were specified), and adds missing (NA) values for all other group+time_value combinations that exist in the archive. The all_rows processing and the slideing use the same grouping variables to produce one obs per group (either for all groups or for a subset of groups). Since they each contain only unique groups and we're joining on the columns used to produce the groups, we can't create new/duplicate/etc obs.

I also noticed that when all_rows is on vs off the output is a different class -- on is data.table, off is tibble.

@nmdefries
Copy link
Contributor

when specifying a single ref time value and/or other situations... a lot of extra rows [are] added

I'm not seeing this. Do you have an example?

@brookslogan
Copy link
Contributor Author

TL;DR: there are some little things here and there making it hard to test/understand minimal examples here that I'd like to work out. We should maybe try to address those first and then revisit this to make sure there's no action needed. I'm still missing what the use cases we're trying to fulfill here are, which is part of why I am so confused.

Originally, I think I was misled by

                # If not a single row, should be the right length, else abort
                else if (nrow(comp_value) != count) {
                  Abort("If the slide computation returns a data frame, then it must have a single row, or else one row per appearance of the reference time value in the local window.")
                }

which seems to be inaccurate: the computation must return 1 row or a number of rows matching the number of distinct non-time_value-non-version-key values encountered. So we shouldn't expect duplicate key values out of this (and don't seem to get them even if we happen to trick the check by having a coincidental match in the number of time values and non-time-non-version key values and group by time_value, at least in my testing so far). But not sure if we need to be concerned about group_by vars that are selected from outside the key vars.

Maybe the major surprise here (that also makes this harder to debug) is that all_rows=TRUE doesn't work as you might expect with ref_time_values specified; one might expect the output to not include time_values outside of ref_time_values but this is not the case.

Additionally, constructing a minimal example is complicated by the silent filtering here:

            else {
              ref_time_values = ref_time_values[ref_time_values %in%
                                                unique(self$DT$time_value)]
            }

which we will likely want to change to raise an error if any of the ref_time_values is > x$versions_end. But all_rows will then also need updated to be resolve the above surprise, because otherwise all_rows would be throwing out the extra ref_time_values we'd be allowing here. Although maybe that's touching on the whole discussion in #163.

@brookslogan
Copy link
Contributor Author

brookslogan commented Aug 24, 2022

A couple random notes from trying to write out other issues:

  • Current all_rows=FALSE docs [for epix_slide] seem inaccurate when we group_by at least one column; this can have multiple rows per time value.
  • I'm guessing the purpose of all_rows=TRUE is to prevent the disappearance of some of the default ref_time_values due to having no data available as of one/more of these ref time values. It seems tailored to the default ref_time_values rather than the input ref_time_values. [Probably it was actually from trying to mirror epi_slide, where the all_rows name seems a bit more natural.]
  • [all_rows seems a little similar in flavor to group_by(drop=FALSE); maybe that's something to think about in/after https://github.com/group_by() for epi_archive objects #64.] [However, it looks like f does receive the empty snapshots, it's just that somehow the result row(s) might be lost?]

So I'm guessing this is a task to do:

  • Adjust behavior of all_rows when ref_time_values is provided. [This shouldn't introduce rows with time_values outside ref_time_values. I'm not sure whether this constraint fully specifies the behavior we should get, though.]

[all_rows on epi_slide (which is more like mutate) is more natural to understand, and is actually seems meant to be paired with manual ref_time_values to do computations on certain rows while keeping the other rows around (with old columns + missing computations).]

@brookslogan
Copy link
Contributor Author

brookslogan commented Oct 21, 2022

Even more notes:

  • epi_slide all_rows docs may have a bug: "otherwise, there will be one row for each time value in x that acts as a reference time value" --- this may not be true when one isn't grouping by all of the key variables; it may be one row per nongrouping key variable combination (including time_value). [Same issue with in the code comments as well.]
  • epix_slide may inherit / have a similar documentation bug.
  • I may just remove the broadcasting and all_rows from epix_slide as I think the primary use cases don't want broadcasting; the all_rows implementation is tied to broadcasting, doesn't make as much sense for a more summarize-like epix_slide, and might contribute to mixups between time_value and version (plus it involves one already, joining times to versions-labeled-times, probably cutting off the most recent computations in a different way than the ref_time_values filter). We might want to make a more mutate-like variant of epix_slide at some point, but this might require some heavier thought; see Make slide function broadcasting controllable. #227, group_by() for epi_archive objects #64; current workaround for a user missing info from the other columns in the slide window is to just add more columns containing that data to the result of their f function.

@brookslogan
Copy link
Contributor Author

In #290 the plan is to just remove all_rows from epix_slide.

@brookslogan brookslogan closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2023
@nmdefries
Copy link
Contributor

Closing associated PR #184.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
op-semantics Operational semantics; many potentially breaking changes here P1 medium priority
Projects
None yet
2 participants