Feature request: function to remove unpopulated groups #4084

jepusto · 2019-01-06T21:51:47Z

dplyr 0.8.0 introduces a breaking change in how group_by treats factor variables. As described in the release notes, data frames where the grouping variables include factors now include groups for every unique combination of the factor levels---including levels that do not appear in the data. This change has evidently been under discussion for quite some time (cf. issue #341) and I understand that there are many situations where it is very useful to preserve zero-length groups. However, I think there are also situations where it is helpful to retain the behavior from previous versions, such as described in issue #4061.

Would it be possible to either A) provide a function such as strip_empty_groups that drops groups with no data or B) include an option in group_by for back-compatibility? A situation that I am puzzling over how to handle is when my data is grouped on multiple factors, but the levels are nested rather than crossed. Here's a simple example:

library(dplyr) # version 0.8.0.9000

df <- tibble(
  f1 = factor(rep(letters[1:2], each = 2^3)),
  f2 = factor(rep(letters[3:6], each = 2^2)),
  f3 = factor(rep(letters[7:14], each = 2)),
  x = rnorm(2^4)
)

df %>%
  group_by(f1, f2, f3) %>%
  summarise(x = mean(x))

#> # A tibble: 64 x 4
#> # Groups:   f1, f2 [8]
#>    f1    f2    f3           x
#>    <fct> <fct> <fct>    <dbl>
#>  1 a     c     g       -0.309
#>  2 a     c     h       -0.226
#>  3 a     c     i      NaN    
#>  4 a     c     j      NaN    
#>  5 a     c     k      NaN    
#>  6 a     c     l      NaN    
#>  7 a     c     m      NaN    
#>  8 a     c     n      NaN    
#>  9 a     d     g      NaN    
#> 10 a     d     h      NaN    
#> # ... with 54 more rows

^{Created on 2019-01-06 by the reprex package (v0.2.1)}

The result is a grouped_df that is four times longer than the original, because the factors have all been crossed. The new group_trim function does not help here (it has no effect at all) because each of the factors has fully populated levels if one looks across the entire data frame.

For this specific example it is easy enough to get back to the original behavior of group_by by converting the factors to character vectors:

df %>%
  mutate_at(vars(f1:f3), as.character) %>%
  group_by(f1, f2, f3) %>%
  summarise(x = mean(x))
#> # A tibble: 8 x 4
#> # Groups:   f1, f2 [4]
#>   f1    f2    f3         x
#>   <chr> <chr> <chr>  <dbl>
#> 1 a     c     g     -0.309
#> 2 a     c     h     -0.226
#> 3 a     d     i      0.441
#> 4 a     d     j      0.759
#> 5 b     e     k      0.730
#> 6 b     e     l      0.778
#> 7 b     f     m     -0.992
#> 8 b     f     n     -0.402

^{Created on 2019-01-06 by the reprex package (v0.2.1)}

This might be fine for one-off analysis scripts, but it does have downsides. Coercing to character loses information from the factor levels (like ordering). And if I later need to join my summarized table with some other results, then I have to deal with type-compatibility of the keys. Would it be acceptable to have something like one of the following?

df %>%
  group_by(f1, f2, f3) %>%
  strip_empty_groups() %>%
  summarise(x = mean(x))

df %>%
  group_by(f1, f2, f3, .drop = TRUE) %>%
  summarise(x = mean(x))

The text was updated successfully, but these errors were encountered:

hadley · 2019-01-07T13:15:16Z

Duplicate of #4061

lock · 2019-07-07T00:10:38Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

romainfrancois mentioned this issue Jan 7, 2019

rework breaking changes section #4077

Closed

hadley marked this as a duplicate of #4061 Jan 7, 2019

hadley closed this as completed Jan 7, 2019

lock bot locked and limited conversation to collaborators Jul 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: function to remove unpopulated groups #4084

Feature request: function to remove unpopulated groups #4084

jepusto commented Jan 6, 2019

hadley commented Jan 7, 2019

lock bot commented Jul 7, 2019

Feature request: function to remove unpopulated groups #4084

Feature request: function to remove unpopulated groups #4084

Comments

jepusto commented Jan 6, 2019

hadley commented Jan 7, 2019

lock bot commented Jul 7, 2019