Description
Hi folks,
My question hinges on this sort of situation. Obviously this example is pretty artificial, but its a situation in which you have a function which acts on a dataframe, returns a dataframe, doesn't return grouping variables, but might need access to them.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
fun <- function(x) {
if (any(x$Species == "setosa")) {
tail(x, n = 3) |> select(Petal.Length)
} else {
head(x, n = 3) |> select(Petal.Length)
}
}
iris |>
group_by(Species) |>
do(fun(.))
#> # A tibble: 9 × 2
#> # Groups: Species [3]
#> Species Petal.Length
#> <fct> <dbl>
#> 1 setosa 1.4
#> 2 setosa 1.5
#> 3 setosa 1.4
#> 4 versicolor 4.7
#> 5 versicolor 4.5
#> 6 versicolor 4.9
#> 7 virginica 6
#> 8 virginica 5.1
#> 9 virginica 5.9
Created on 2025-03-01 with reprex v2.1.1
do()
can deal with this admirably, but I'm unsure what the modern equivalent is.
purrr::map()
doesn't behave the same because it drops the group variables, so you don't know what is what:
> purrr::map(
+ split(iris, ~Species),
+ fun
+ ) |>
+ dplyr::bind_rows()
Petal.Length
1 1.4
2 1.5
3 1.4
4 4.7
5 4.5
6 4.9
7 6.0
8 5.1
9 5.9
nest()
also doesn't really work because the nested dataframe has no access to the Species
column:
> iris |>
+ nest_by(Species) |>
+ mutate(data = list(fun(data)))
# A tibble: 3 × 2
# Rowwise: Species
Species data
<fct> <list>
1 setosa <tibble [3 × 1]>
2 versicolor <tibble [3 × 1]>
3 virginica <tibble [3 × 1]>
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `data = list(fun(data))`.
ℹ In row 1.
Caused by warning:
! Unknown or uninitialised column: `Species`.
ℹ Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings.
group_modify()
is recommended by the do()
documentation, but it is experimental, so I don't want to put it in packages until its stable. It also doesn't use .by
, which may imply this isn't a function that's going to be taken forward? It seems to be the closest thing, however.
reframe()
I believe can only act on columns, so I don't think that's quite right either? It is also "experimental".
The answer may lie in pick()
, but I'm not quite sure how to apply it to this specific use case. It also seems to not 'nest' the grouping variables, so it has the same issue as the above nest()
example.
So I'm at a bit of a loss as to what the new do()
actually is!
Cheers.