Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: function to remove unpopulated groups #4084

Closed
jepusto opened this issue Jan 6, 2019 · 2 comments
Closed

Feature request: function to remove unpopulated groups #4084

jepusto opened this issue Jan 6, 2019 · 2 comments

Comments

@jepusto
Copy link

jepusto commented Jan 6, 2019

dplyr 0.8.0 introduces a breaking change in how group_by treats factor variables. As described in the release notes, data frames where the grouping variables include factors now include groups for every unique combination of the factor levels---including levels that do not appear in the data. This change has evidently been under discussion for quite some time (cf. issue #341) and I understand that there are many situations where it is very useful to preserve zero-length groups. However, I think there are also situations where it is helpful to retain the behavior from previous versions, such as described in issue #4061.

Would it be possible to either A) provide a function such as strip_empty_groups that drops groups with no data or B) include an option in group_by for back-compatibility? A situation that I am puzzling over how to handle is when my data is grouped on multiple factors, but the levels are nested rather than crossed. Here's a simple example:

library(dplyr) # version 0.8.0.9000

df <- tibble(
  f1 = factor(rep(letters[1:2], each = 2^3)),
  f2 = factor(rep(letters[3:6], each = 2^2)),
  f3 = factor(rep(letters[7:14], each = 2)),
  x = rnorm(2^4)
)

df %>%
  group_by(f1, f2, f3) %>%
  summarise(x = mean(x))

#> # A tibble: 64 x 4
#> # Groups:   f1, f2 [8]
#>    f1    f2    f3           x
#>    <fct> <fct> <fct>    <dbl>
#>  1 a     c     g       -0.309
#>  2 a     c     h       -0.226
#>  3 a     c     i      NaN    
#>  4 a     c     j      NaN    
#>  5 a     c     k      NaN    
#>  6 a     c     l      NaN    
#>  7 a     c     m      NaN    
#>  8 a     c     n      NaN    
#>  9 a     d     g      NaN    
#> 10 a     d     h      NaN    
#> # ... with 54 more rows

Created on 2019-01-06 by the reprex package (v0.2.1)

The result is a grouped_df that is four times longer than the original, because the factors have all been crossed. The new group_trim function does not help here (it has no effect at all) because each of the factors has fully populated levels if one looks across the entire data frame.

For this specific example it is easy enough to get back to the original behavior of group_by by converting the factors to character vectors:

df %>%
  mutate_at(vars(f1:f3), as.character) %>%
  group_by(f1, f2, f3) %>%
  summarise(x = mean(x))
#> # A tibble: 8 x 4
#> # Groups:   f1, f2 [4]
#>   f1    f2    f3         x
#>   <chr> <chr> <chr>  <dbl>
#> 1 a     c     g     -0.309
#> 2 a     c     h     -0.226
#> 3 a     d     i      0.441
#> 4 a     d     j      0.759
#> 5 b     e     k      0.730
#> 6 b     e     l      0.778
#> 7 b     f     m     -0.992
#> 8 b     f     n     -0.402

Created on 2019-01-06 by the reprex package (v0.2.1)

This might be fine for one-off analysis scripts, but it does have downsides. Coercing to character loses information from the factor levels (like ordering). And if I later need to join my summarized table with some other results, then I have to deal with type-compatibility of the keys. Would it be acceptable to have something like one of the following?

df %>%
  group_by(f1, f2, f3) %>%
  strip_empty_groups() %>%
  summarise(x = mean(x))

df %>%
  group_by(f1, f2, f3, .drop = TRUE) %>%
  summarise(x = mean(x))
@hadley
Copy link
Member

hadley commented Jan 7, 2019

Duplicate of #4061

@hadley hadley marked this as a duplicate of #4061 Jan 7, 2019
@hadley hadley closed this as completed Jan 7, 2019
@lock
Copy link

lock bot commented Jul 7, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants