You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dplyr 0.8.0 introduces a breaking change in how group_by treats factor variables. As described in the release notes, data frames where the grouping variables include factors now include groups for every unique combination of the factor levels---including levels that do not appear in the data. This change has evidently been under discussion for quite some time (cf. issue #341) and I understand that there are many situations where it is very useful to preserve zero-length groups. However, I think there are also situations where it is helpful to retain the behavior from previous versions, such as described in issue #4061.
Would it be possible to either A) provide a function such as strip_empty_groups that drops groups with no data or B) include an option in group_by for back-compatibility? A situation that I am puzzling over how to handle is when my data is grouped on multiple factors, but the levels are nested rather than crossed. Here's a simple example:
library(dplyr) # version 0.8.0.9000df<- tibble(
f1=factor(rep(letters[1:2], each=2^3)),
f2=factor(rep(letters[3:6], each=2^2)),
f3=factor(rep(letters[7:14], each=2)),
x= rnorm(2^4)
)
df %>%
group_by(f1, f2, f3) %>%
summarise(x= mean(x))
#> # A tibble: 64 x 4#> # Groups: f1, f2 [8]#> f1 f2 f3 x#> <fct> <fct> <fct> <dbl>#> 1 a c g -0.309#> 2 a c h -0.226#> 3 a c i NaN #> 4 a c j NaN #> 5 a c k NaN #> 6 a c l NaN #> 7 a c m NaN #> 8 a c n NaN #> 9 a d g NaN #> 10 a d h NaN #> # ... with 54 more rows
The result is a grouped_df that is four times longer than the original, because the factors have all been crossed. The new group_trim function does not help here (it has no effect at all) because each of the factors has fully populated levels if one looks across the entire data frame.
For this specific example it is easy enough to get back to the original behavior of group_by by converting the factors to character vectors:
df %>%
mutate_at(vars(f1:f3), as.character) %>%
group_by(f1, f2, f3) %>%
summarise(x= mean(x))
#> # A tibble: 8 x 4#> # Groups: f1, f2 [4]#> f1 f2 f3 x#> <chr> <chr> <chr> <dbl>#> 1 a c g -0.309#> 2 a c h -0.226#> 3 a d i 0.441#> 4 a d j 0.759#> 5 b e k 0.730#> 6 b e l 0.778#> 7 b f m -0.992#> 8 b f n -0.402
This might be fine for one-off analysis scripts, but it does have downsides. Coercing to character loses information from the factor levels (like ordering). And if I later need to join my summarized table with some other results, then I have to deal with type-compatibility of the keys. Would it be acceptable to have something like one of the following?
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
lockbot
locked and limited conversation to collaborators
Jul 7, 2019
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
dplyr 0.8.0 introduces a breaking change in how
group_by
treats factor variables. As described in the release notes, data frames where the grouping variables include factors now include groups for every unique combination of the factor levels---including levels that do not appear in the data. This change has evidently been under discussion for quite some time (cf. issue #341) and I understand that there are many situations where it is very useful to preserve zero-length groups. However, I think there are also situations where it is helpful to retain the behavior from previous versions, such as described in issue #4061.Would it be possible to either A) provide a function such as
strip_empty_groups
that drops groups with no data or B) include an option ingroup_by
for back-compatibility? A situation that I am puzzling over how to handle is when my data is grouped on multiple factors, but the levels are nested rather than crossed. Here's a simple example:Created on 2019-01-06 by the reprex package (v0.2.1)
The result is a
grouped_df
that is four times longer than the original, because the factors have all been crossed. The newgroup_trim
function does not help here (it has no effect at all) because each of the factors has fully populated levels if one looks across the entire data frame.For this specific example it is easy enough to get back to the original behavior of
group_by
by converting the factors to character vectors:Created on 2019-01-06 by the reprex package (v0.2.1)
This might be fine for one-off analysis scripts, but it does have downsides. Coercing to character loses information from the factor levels (like ordering). And if I later need to join my summarized table with some other results, then I have to deal with type-compatibility of the keys. Would it be acceptable to have something like one of the following?
The text was updated successfully, but these errors were encountered: