Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental do() replacement) #7666

Closed
jack-davison opened this issue Mar 1, 2025 · 4 comments

Comments

@jack-davison
Copy link

jack-davison commented Mar 1, 2025

Hi folks,

My question hinges on this sort of situation. Obviously this example is pretty artificial, but its a situation in which you have a function which acts on a dataframe, returns a dataframe, doesn't return grouping variables, but might need access to them.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

fun <- function(x) {
  if (any(x$Species == "setosa")) {
    tail(x, n = 3) |> select(Petal.Length)
  } else {
    head(x, n = 3) |> select(Petal.Length)
  }
}

iris |>
  group_by(Species) |>
  do(fun(.))
#> # A tibble: 9 × 2
#> # Groups:   Species [3]
#>   Species    Petal.Length
#>   <fct>             <dbl>
#> 1 setosa              1.4
#> 2 setosa              1.5
#> 3 setosa              1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7 virginica           6  
#> 8 virginica           5.1
#> 9 virginica           5.9

Created on 2025-03-01 with reprex v2.1.1

do() can deal with this admirably, but I'm unsure what the modern equivalent is.

purrr::map() doesn't behave the same because it drops the group variables, so you don't know what is what:

> purrr::map(
+   split(iris, ~Species),
+   fun
+ ) |>
+   dplyr::bind_rows()
  Petal.Length
1          1.4
2          1.5
3          1.4
4          4.7
5          4.5
6          4.9
7          6.0
8          5.1
9          5.9

nest() also doesn't really work because the nested dataframe has no access to the Species column:

> iris |>
+   nest_by(Species) |>
+   mutate(data = list(fun(data)))
# A tibble: 3 × 2
# Rowwise:  Species
  Species    data            
  <fct>      <list>          
1 setosa     <tibble [3 × 1]>
2 versicolor <tibble [3 × 1]>
3 virginica  <tibble [3 × 1]>
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:In argument: `data = list(fun(data))`.In row 1.
Caused by warning:
! Unknown or uninitialised column: `Species`.Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings. 

group_modify() is recommended by the do() documentation, but it is experimental, so I don't want to put it in packages until its stable. It also doesn't use .by, which may imply this isn't a function that's going to be taken forward? It seems to be the closest thing, however.

reframe() I believe can only act on columns, so I don't think that's quite right either? It is also "experimental".

The answer may lie in pick(), but I'm not quite sure how to apply it to this specific use case. It also seems to not 'nest' the grouping variables, so it has the same issue as the above nest() example.

So I'm at a bit of a loss as to what the new do() actually is!

Cheers.

@DavisVaughan
Copy link
Member

I think the slight weirdness for you is that you'd like access to the metadata about the current group inside your function. You can get access to that with cur_group(). How's this?

library(dplyr, warn.conflicts = FALSE)

fun <- function(data, group) {
  if (group$Species == "setosa") {
    tail(data, n = 3) |> select(Petal.Length)
  } else {
    head(data, n = 3) |> select(Petal.Length)
  }
}

iris |>
  reframe(.by = Species, {
    fun(pick(everything()), cur_group())
  })
#>      Species Petal.Length
#> 1     setosa          1.4
#> 2     setosa          1.5
#> 3     setosa          1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7  virginica          6.0
#> 8  virginica          5.1
#> 9  virginica          5.9

I think we are fairly confident that reframe() is useful and here to stay, we can probably move it out of the experimental stage.

@alejandrohagan
Copy link

hi -- a few options

one you can duplicate the Species column and then pass through your function as is-- example below

library(tidyverse)

fun <- function(x) {
    if (any(x$Species == "setosa")) {
        tail(x, n = 3) |> select(Petal.Length)
    } else {
        head(x, n = 3) |> select(Petal.Length)
    }
}


validation_tbl <- iris |>
    group_by(Species) |>
    do(fun(.))


method1_tbl <- iris |> 
    mutate(
        Species2=Species
    ) |> 
    nest_by(Species2) |> 
    mutate(
        fn=list(fun(data))
    ) |> 
    unnest(fn) |> 
    select(
        Species=Species2
        ,Petal.Length
    )

all.equal(validation_tbl,method1_tbl)

Alternatively you can slightly modify your function so that you pass through pass the data and the nest_by column


fun2 <- function(data,species) {
    if (any(species== "setosa")) {
        tail(data, n = 3) |> select(Petal.Length)
    } else {
        head(data, n = 3) |> select(Petal.Length)
    }
}


method2_tbl <- iris |> 
    nest_by(Species) |> 
    mutate(
        fn=list(fun2(data,Species))
    ) |> 
    unnest(fn) |> 
    select(
        Species
        ,Petal.Length
    )


all.equal(validation_tbl,method2_tbl)

@alejandrohagan
Copy link

one other method using map() -- just need to slightly alter your function to return the Species name


fun3 <- function(x) {
    if (any(x$Species == "setosa")) {
        tail(x, n = 3) |> select(Species,Petal.Length)
    } else {
        head(x, n = 3) |> select(Species,Petal.Length)
    }
}


split_tbl <- iris |> 
    group_split(Species)
    

method3_tbl <- map(.x = split_tbl,.f = \(.x) fun3(.x)) |> 
    purrr::list_rbind()

@jack-davison
Copy link
Author

Hi folks,

Thanks very much for your replies!

I was hoping there'd be a one-to-one do() replacement that wouldn't require changing fun() in any way, as our use case is modernising a package and I was hoping to not have to go in and amend a load of our internal helper functions. Reading between the lines, it seems like that's not going to be the case.

I think Davis's suggestion may be the way to go; thanks for your help!

Best,
Jack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants