-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The case for filter(.missing = NULL, .how = c("keep", "drop"))
#6891
Comments
We can also lean into the ambiguity of Or maybe df %>% filter(.for = c(TRUE, NA)) Although that means the equivalent of The |
A key idea while thinking about this is realizing that both The common behavior between them is that both of them treat
The only algorithmic difference between them is whether or not we use a
Then: keep_rows(df, a, b) == df[result,]
drop_rows(df, a, b) == df[!result,] This gives a nice theoretical result where the following rbind recreates
The theoretical The full grid of possible things to "keep" are:
So only 4 of these seem needed - the 4 that go with "Keep NA only" and "Drop NA only" aren't actually as useful as they might appear, because they require that the columns are already logical vectors to work. i.e. I think this also reveals that the signatures for keep_rows(data, ..., missing = c("drop", "keep", "error"))
drop_rows(data, ..., missing = c("keep", "drop", "error")) which do what you'd expect them to do by default |
drop_rows()
and keep_rows()
to supersede filter()
drop_rows()
and keep_rows()
in relation to filter()
What about a Another confusion btw sometimes comes from how |
filter(
.data,
...,
.missing = NULL,
.how = c("keep", "drop")
) This signature may work, where
This way when you are trying to drop rows, it still only drops rows where the condition is I like that that value of I also like that we use the same |
drop_rows()
and keep_rows()
in relation to filter()
filter(.missing = NULL, .how = c("keep", "drop"))
There have been quite a few requests in the past for an "anti filter", i.e. I want to specify a set of conditions that determine which rows to drop. Additionally, it has traditionally been somewhat difficult to explain that
filter()
is about specifying rows to keep; that isn't really explained clearly in the verb name. Also, we've also seen in the past that it is mildly confusing thatselect()
is about columns andfilter()
is about rows, again, there isn't anything in the verb names to describe the difference.One thing we could consider doing is to add two new very explicit verbs:
keep_rows(data, ..., by = , missing = )
drop_rows(data, ..., by = , missing = )
Where
keep_rows()
is equivalent tofilter()
, anddrop_rows()
is the opposite.To be very clear,
filter()
would never disappear. We would, however, consider superseding it in favor of these if they prove to be successful, which really only means we'd start using them in docs and workshops instead offilter()
. We'd even consider not even supersedingfilter()
, which many people find scary. Instead we'd just be aliasingkeep_rows()
asfilter()
.The biggest annoyance when writing a "drop" style expression with
filter()
is that you first have to write a "keep" expression and then painfully invert it. i.e.:"drop rows from
df
wherea
andb
andc
areTRUE
"Note that even the seemingly correct "drop" expression is actually wrong when it comes to handling missing values. It is fairly hard to get this right.
The
drop_rows()
version would be:Where
NA
isn't considered something you "drop" by default, but would be ifmissing
was tweaked to whatever we decide means "treat a missing value likeTRUE
".A few other notes:
missing
is fromfilter(.missing = )
option to optionally retain missing values #6560 and controls how missing values are treated. By default, both functions would treat anNA
asFALSE
(i.e. missing values are never kept or dropped), but could be made to treat them asTRUE
or an error. Though I don't thinkmissing = c("keep", "drop", "error")
works uniformly for both verbs so we'd need to think of another parameterization.if_all()
andif_any()
, which I think form nice natural sentences. "drop rows if any are NA" sounds pretty good fordrop_rows(df, if_any(c(a, b), is.na))
. That is liketidyr::drop_na()
.across()
, which we have been deprecating fromfilter()
for a little while now.&
, as that is typically the natural way to combine multiple conditions and you can always get|
behavior with either an explicit|
or by using multiple calls to the function. i.e.df %>% drop_rows(x > 5 | y > 6)
is the same asdf %>% drop_rows(x > 5) %>% drop_rows(y > 6)
(and you can't do that split trick with&
).if_any()
can also work for|
when you need to apply the same function to multiple columns.by
Some issues and questions related to this:
The text was updated successfully, but these errors were encountered: