Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental `do()` replacement)

Hi folks,

My question hinges on this sort of situation. Obviously this example is pretty artificial, but its a situation in which you have a function which acts on a dataframe, returns a dataframe, doesn't return grouping variables, but might need access to them.

``` r
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

fun <- function(x) {
  if (any(x$Species == "setosa")) {
    tail(x, n = 3) |> select(Petal.Length)
  } else {
    head(x, n = 3) |> select(Petal.Length)
  }
}

iris |>
  group_by(Species) |>
  do(fun(.))
#> # A tibble: 9 × 2
#> # Groups:   Species [3]
#>   Species    Petal.Length
#>   <fct>             <dbl>
#> 1 setosa              1.4
#> 2 setosa              1.5
#> 3 setosa              1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7 virginica           6  
#> 8 virginica           5.1
#> 9 virginica           5.9
```

<sup>Created on 2025-03-01 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

`do()` can deal with this admirably, but I'm unsure what the modern equivalent is.

`purrr::map()` doesn't behave the same because it drops the group variables, so you don't know what is what:

```r
> purrr::map(
+   split(iris, ~Species),
+   fun
+ ) |>
+   dplyr::bind_rows()
  Petal.Length
1          1.4
2          1.5
3          1.4
4          4.7
5          4.5
6          4.9
7          6.0
8          5.1
9          5.9
```

`nest()` also doesn't really work because the nested dataframe has no access to the `Species` column:

```r
> iris |>
+   nest_by(Species) |>
+   mutate(data = list(fun(data)))
# A tibble: 3 × 2
# Rowwise:  Species
  Species    data            
  <fct>      <list>          
1 setosa     <tibble [3 × 1]>
2 versicolor <tibble [3 × 1]>
3 virginica  <tibble [3 × 1]>
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `data = list(fun(data))`.
ℹ In row 1.
Caused by warning:
! Unknown or uninitialised column: `Species`.
ℹ Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings. 
```

`group_modify()` is recommended by the `do()` documentation, but it is experimental, so I don't want to put it in packages until its stable. It also doesn't use `.by`, which may imply this isn't a function that's going to be taken forward? It seems to be the closest thing, however.

`reframe()` I believe can only act on columns, so I don't think that's quite right either? It is also "experimental".

The answer may lie in `pick()`, but I'm not quite sure how to apply it to this specific use case. It also seems to not 'nest' the grouping variables, so it has the same issue as the above `nest()` example.

So I'm at a bit of a loss as to what the new `do()` actually is!

Cheers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental `do()` replacement) #7666

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental do() replacement) #7666

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental `do()` replacement) #7666