Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The case for filter(.missing = NULL, .how = c("keep", "drop")) #6891

Open
DavisVaughan opened this issue Jul 24, 2023 · 4 comments
Open

The case for filter(.missing = NULL, .how = c("keep", "drop")) #6891

DavisVaughan opened this issue Jul 24, 2023 · 4 comments
Labels
feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange() verbs 🏃‍♀️

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 24, 2023

There have been quite a few requests in the past for an "anti filter", i.e. I want to specify a set of conditions that determine which rows to drop. Additionally, it has traditionally been somewhat difficult to explain that filter() is about specifying rows to keep; that isn't really explained clearly in the verb name. Also, we've also seen in the past that it is mildly confusing that select() is about columns and filter() is about rows, again, there isn't anything in the verb names to describe the difference.


One thing we could consider doing is to add two new very explicit verbs:

  • keep_rows(data, ..., by = , missing = )
  • drop_rows(data, ..., by = , missing = )

Where keep_rows() is equivalent to filter(), and drop_rows() is the opposite.

To be very clear, filter() would never disappear. We would, however, consider superseding it in favor of these if they prove to be successful, which really only means we'd start using them in docs and workshops instead of filter(). We'd even consider not even superseding filter(), which many people find scary. Instead we'd just be aliasing keep_rows() as filter().

The biggest annoyance when writing a "drop" style expression with filter() is that you first have to write a "keep" expression and then painfully invert it. i.e.:

"drop rows from df where a and b and c are TRUE"

library(dplyr)

df <- tibble(
  id = c(1, 2, 3),
  a = c(TRUE, NA, TRUE),
  b = c(FALSE, TRUE, TRUE),
  c = c(TRUE, TRUE, TRUE)
)
df
#> # A tibble: 3 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 
#> 3     3 TRUE  TRUE  TRUE

# "keep rows where a and b and c are TRUE"
# this nicely drops NAs because they don't match our specified criteria
filter(df, a, b, c)
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     3 TRUE  TRUE  TRUE

# "drop rows where a and b and c are TRUE"
# this AWFULLY drops NAs because of how filter() works!
# NA doesn't match our criteria so shouldnt be seen as something to drop
filter(df, !(a & b & c))
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE

Note that even the seemingly correct "drop" expression is actually wrong when it comes to handling missing values. It is fairly hard to get this right.

The drop_rows() version would be:

df %>% drop_rows(a, b, c)
#> # A tibble: 2 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 

Where NA isn't considered something you "drop" by default, but would be if missing was tweaked to whatever we decide means "treat a missing value like TRUE".


A few other notes:

  • missing is from filter(.missing = ) option to optionally retain missing values #6560 and controls how missing values are treated. By default, both functions would treat an NA as FALSE (i.e. missing values are never kept or dropped), but could be made to treat them as TRUE or an error. Though I don't think missing = c("keep", "drop", "error") works uniformly for both verbs so we'd need to think of another parameterization.
  • Both functions would support if_all() and if_any(), which I think form nice natural sentences. "drop rows if any are NA" sounds pretty good for drop_rows(df, if_any(c(a, b), is.na)). That is like tidyr::drop_na().
  • Neither would support across(), which we have been deprecating from filter() for a little while now.
  • Both functions would combine multiple conditions using &, as that is typically the natural way to combine multiple conditions and you can always get | behavior with either an explicit | or by using multiple calls to the function. i.e. df %>% drop_rows(x > 5 | y > 6) is the same as df %>% drop_rows(x > 5) %>% drop_rows(y > 6) (and you can't do that split trick with &). if_any() can also work for | when you need to apply the same function to multiple columns.
  • Both functions would support by

Some issues and questions related to this:

@DavisVaughan DavisVaughan added feature a feature request or enhancement verbs 🏃‍♀️ rows ↕️ Operations on rows: filter(), slice(), arrange() labels Jul 24, 2023
@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jul 24, 2023

We can also lean into the ambiguity of filter() and think about adding an argument like filter(.for = c(TRUE, FALSE, NA)) (which may also partly (completely?) handle #6560)

Or maybe

df %>% filter(.for = c(TRUE, NA))

Although that means the equivalent of df %>% drop_rows(a, b, c) is df %>% filter(a, b, c, .for = c(FALSE, NA)) which I think is still too hard to remember

The .for argument means you can technically do all 8 possible combinations of things to "keep", but I prove in the table in the comment below that that isn't actually useful.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jul 24, 2023

A key idea while thinking about this is realizing that both keep_rows(df, a, b) and drop_rows(df, a, b) would apply the exact same internal algorithm to combine a, b into a single logical result vector.

The common behavior between them is that both of them treat NA as FALSE by default. But:

  • For keep_rows() that means NA is dropped
  • For drop_rows() that means NA is kept

The only algorithmic difference between them is whether or not we use a ! at the very end before slicing.

a b result
T T T
F F F
NA NA F
T F F
F T F
T NA F
NA T F
F NA F
NA F F

Then:

keep_rows(df, a, b) == df[result,]
drop_rows(df, a, b) == df[!result,]

This gives a nice theoretical result where the following rbind recreates df (ignoring row ordering) even when NAs are in the mix:

rbind(
  keep_rows(df, a, b),
  drop_rows(df, a, b)
)

The theoretical missing argument would just mean that if missing = "keep" then NA is treated as TRUE instead.


The full grid of possible things to "keep" are:

name .for TRUE FALSE NA Useful?
keep_rows() filter(.for = TRUE)
keep_rows() + keep NA filter(.for = c(TRUE, NA))
drop_rows() filter(.for = c(FALSE, NA))
drop_rows() + drop NA filter(.for = FALSE)
Keep NA only filter(.for = NA) ❌, keep_rows(df, if_all(cols, is.na))
Drop NA only filter(.for = c(TRUE, FALSE)) ❌, drop_rows(df, if_all(cols, is.na)), tidyr::drop_na()
Keep everything filter(.for = c(TRUE, FALSE, NA))
Drop everything filter(.for = logical())

So only 4 of these seem needed - the 4 that go with keep/drop_rows() already.

"Keep NA only" and "Drop NA only" aren't actually as useful as they might appear, because they require that the columns are already logical vectors to work. i.e. drop_na(df, x) works if x is a double column, but filter(df, x, .for = c(TRUE, FALSE)) doesn't. Since you can accomplish what you actually want with if_all(cols, is.na) or just drop_na(), we don't need to worry about these.

I think this also reveals that the signatures for missing could be:

keep_rows(data, ..., missing = c("drop", "keep", "error"))
drop_rows(data, ..., missing = c("keep", "drop", "error"))

which do what you'd expect them to do by default

@DavisVaughan DavisVaughan changed the title The case for drop_rows() and keep_rows() to supersede filter() The case for drop_rows() and keep_rows() in relation to filter() Jul 24, 2023
@moodymudskipper
Copy link

moodymudskipper commented Jul 30, 2023

What about a .how = c("keep", "drop") arg ? The signature of the function would signal what the name doesn't, and no need for more aliases and deprecation, the .missing arg takes care of the other ambiguity. I find .for confusing.

Another confusion btw sometimes comes from how ... are combined, filter() does an intersection, while select() does a union, here also a new arg would both make the function more flexible and make the signature self explanatory regarding the default behavior.

@DavisVaughan
Copy link
Member Author

filter(
  .data,
  ...,
  .missing = NULL,
  .how = c("keep", "drop")
)

This signature may work, where .missing can still take on one of c("keep", "drop", "error"), but would default to:

  • .missing = "drop" for .how = "keep" (current behavior of filter())
  • .missing = "keep" for .how = "drop"

This way when you are trying to drop rows, it still only drops rows where the condition is TRUE.

I like that that value of .missing makes filter(df, cond, .how = "keep") and filter(df, cond, .how = "drop") completely symmetric, even regarding missing values (i.e. you can rbind the results and get df back).

I also like that we use the same keep / drop terms in both arguments for consistency

@DavisVaughan DavisVaughan changed the title The case for drop_rows() and keep_rows() in relation to filter() The case for filter(.missing = NULL, .how = c("keep", "drop")) Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange() verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

2 participants