-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation Request: How do I drop empty groups? #4061
Comments
From reading the code, I think there is no drop argument... Converting as character was an option from the discussion in #341. Is filtering after the operation not enough or as simple as desired ? library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
packageVersion("dplyr")
#> [1] '0.7.99.9000'
d <-
data.frame(
A=factor(
rep(c("10 mg", "20 mg", "100 mg"), 4),
levels=c("10 mg", "20 mg", "100 mg"),
ordered=TRUE
),
B=rep(factor(1:6), 2),
C=1:12
)
d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C)
) %>%
filter(!is.nan(Mean))
#> # A tibble: 6 x 3
#> # Groups: A [3]
#> A B Mean
#> <ord> <fct> <dbl>
#> 1 10 mg 1 4
#> 2 10 mg 4 7
#> 3 20 mg 2 5
#> 4 20 mg 5 8
#> 5 100 mg 3 6
#> 6 100 mg 6 9 Created on 2018-12-31 by the reprex package (v0.2.1) If we consider the previous behaviour as not correct because it automatically filtered out groups that where is the data, causing missing data, filtering out manually if needed now that groups are preserved seems like the right move. I believe that custom functions that assumes non empty group could return What do you think ? |
If this is the only method for getting close to prior behavior, my request is that be documented (in a way that doesn't require reading the code). Converting to character is mostly reasonable for unordered factors, but it loses information for ordered factors. Sometimes setting the level order is nontrivial. (All of these are possible, but would need I have a difference of opinion about the previous behavior being incorrect/correct; I think that they are simply filling two different needs. The prior method fills the need for a summary of each group where data exists when all combinations are not expected (this is my common use case described above), and the future behavior helps elucidate missing groups when all combinations are expected. Previously, when I needed the future use case, I integrated explicit missing rows, but I acknowledge that solution isn't right nor best for everyone. I'm not the package author, so I'm not advocating that my opinion must be followed-- I'm trying to understand the new coding method and how to work with it or around it. Since my most common use case is that all combinations are not expected, I will need to rewrite all R functions used for summarization and then filter based on no data. Also, since NA and NaN separately can exist within my data, I would need to add a number of observations column to filter on instead. This is possible but onerous: min_maybe_none <- function(x, ...)
UseMethod("min_maybe_none")
min_maybe_none.numeric <- function(x, ...) {
if (length(x) == 0) {
NA_real_
} else {
min(x, ...)
}
}
# Repeat for logical, factor, character so that you get the correct type of NA output for each class.
# Repeat all of the above for max, median, mean, sd, ... I would need to do this because I often have >50 groups and with standard warning reporting, I would not see unexpected warnings; standard warning reporting would just show warnings about the empty groups. |
I had a few ideas overnight of how the prior behavior could be achieved. I will reiterate the ones above in addition to the new ideas so that the list can be (semi-)comprehensive.
Additional Discussion of Specific OptionsOption 1Conversion to character strings and back could be simplified by implementation of something like tidyverse/forcats#144 and maybe a data.frame mirroring method like the following (typed directly into GitHub; there could be bugs): fct_mirror.data.frame <- function(.f, .x) {
# Find factor columns in the original data that are also in the current data
matching_cols <- intersect(colnames(.f), colnames(.x)[sapply(X=.x, FUN=is.factor)])
for (current_col in matching_cols) {
if (!is.factor(.f[[current_col]])) {
.f[[current_col]] <- fct_mirror(.f[[current_col]], .x[[current_col]])
}
}
.f
} Option 2For 2, the simplest would be if grouping of factors is detected by something simple like as.hidden_factor <- function(x)
UseMethod("as.hidden_factor")
as.hidden_factor.factor <- function(x) {
class(x) <- unique(c("hidden_factor", class(x)))
x
}
as.hidden_factor.default <- function(x) {
message("No need to convert x to a hidden_factor")
x
}
is.factor.hidden_factor function(x) <- FALSE
as.factor.hidden_factor <- function(x) {
class(x) <- setdiff(class(x), "hidden_factor")
x
} That would break some other applications looking for factors, but it or something similar could result in the same behaviour as the prior version. |
Is it possible to implement an option for
The empty groups don't become a problem when I feel something like this d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C),
.skip_empty = TRUE
) is safer than this (or other ways using some sentinal values to represent empty groups): d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C)
) %>%
filter(!is.nan(Mean)) (I also want other verbs like |
@yutannihilation, If applied to all verbs, something like As it relates to the data, If applied at the grouping level, |
I met with @romainfrancois this morning and we're going to add |
(@yutannihilation I don't think |
This is on my list for tomorrow morning. I guess I'm not sure how this impacts the group_by that are implicit, e.g. a group by that is due to a |
@hadley and @romainfrancois, Thank you!!! @romainfrancois, for Another intermediate case would be when |
@romainfrancois, if you implement the Two options that occur to me are:
|
Cool! I agree that |
To preserve backward compatibility, can the default be |
Nope ;-) We still believe that what |
@romainfrancois Perfectly reasonable, I appreciate the accommodation! P.S. I was wondering if I was asking too much there, but I'm a strong believer that you never get anything that you don't ask for. |
Hello @hadley and @romainfrancois I am planning how/when to modify my package to ensure it is compatible with dplyr 0.8 (currently many automated tests are failing against the RC). Adding @romainfrancois - I also emailed you directly (not sure if you prefer such questions here or via email?) - sorry for the duplication. |
Actually, we reverted the default behaviour so that The fix is in this PR, which hopefully I should merge today or this week. |
actually @billdenney we made |
@romainfrancois, Thanks! (And, cool new hair! :)) |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
In a large amount of code across both my PKNCA package and data analysis projects, my code assumes that groups have at least one row. I understand that the behavior is by design, but the documentation does not suggest any methods to get back to the prior method of dropping empty groups. I typically use
filter()
to remove groups not of interest andgroup_by()
to enable various applications of analysis on groups which filled. The only method I can see to get back to the prior behavior is to convert everything factor to a character and then converting back at the end which is pretty intensive to track which columns have been modified and need to be modified back.For most of my use cases, this makes use of
dplyr
with factor grouping very difficult. A common use case for me has treatments and subjects in clinical trials described by ordered and unordered factors, respectively (reprex at the bottom).Is there any method to revert back to the original behaviour? Specifically, in #341,
drop = FALSE
was suggested, but if it exists, it's not documented in the help page forgroup_by()
. The.preserve
argument offilter()
doesn't suggest that it does the same, but it looks close. The documentation of thedrop
argument forgrouped_df()
simply says "deprecated" suggesting it's not the right fix.The compatibility vignette (https://github.com/tidyverse/dplyr/blob/master/vignettes/compatibility.Rmd) appears to be the place something like this should be documented, but I don't see it there either.
Created on 2018-12-31 by the reprex package (v0.2.0).
The text was updated successfully, but these errors were encountered: