Skip to content

Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental do() replacement) #7666

Closed
@jack-davison

Description

@jack-davison

Hi folks,

My question hinges on this sort of situation. Obviously this example is pretty artificial, but its a situation in which you have a function which acts on a dataframe, returns a dataframe, doesn't return grouping variables, but might need access to them.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

fun <- function(x) {
  if (any(x$Species == "setosa")) {
    tail(x, n = 3) |> select(Petal.Length)
  } else {
    head(x, n = 3) |> select(Petal.Length)
  }
}

iris |>
  group_by(Species) |>
  do(fun(.))
#> # A tibble: 9 × 2
#> # Groups:   Species [3]
#>   Species    Petal.Length
#>   <fct>             <dbl>
#> 1 setosa              1.4
#> 2 setosa              1.5
#> 3 setosa              1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7 virginica           6  
#> 8 virginica           5.1
#> 9 virginica           5.9

Created on 2025-03-01 with reprex v2.1.1

do() can deal with this admirably, but I'm unsure what the modern equivalent is.

purrr::map() doesn't behave the same because it drops the group variables, so you don't know what is what:

> purrr::map(
+   split(iris, ~Species),
+   fun
+ ) |>
+   dplyr::bind_rows()
  Petal.Length
1          1.4
2          1.5
3          1.4
4          4.7
5          4.5
6          4.9
7          6.0
8          5.1
9          5.9

nest() also doesn't really work because the nested dataframe has no access to the Species column:

> iris |>
+   nest_by(Species) |>
+   mutate(data = list(fun(data)))
# A tibble: 3 × 2
# Rowwise:  Species
  Species    data            
  <fct>      <list>          
1 setosa     <tibble [3 × 1]>
2 versicolor <tibble [3 × 1]>
3 virginica  <tibble [3 × 1]>
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:In argument: `data = list(fun(data))`.In row 1.
Caused by warning:
! Unknown or uninitialised column: `Species`.Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings. 

group_modify() is recommended by the do() documentation, but it is experimental, so I don't want to put it in packages until its stable. It also doesn't use .by, which may imply this isn't a function that's going to be taken forward? It seems to be the closest thing, however.

reframe() I believe can only act on columns, so I don't think that's quite right either? It is also "experimental".

The answer may lie in pick(), but I'm not quite sure how to apply it to this specific use case. It also seems to not 'nest' the grouping variables, so it has the same issue as the above nest() example.

So I'm at a bit of a loss as to what the new do() actually is!

Cheers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions