Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when summarise refers to previously created variable #75

Closed
hadley opened this issue Jun 26, 2019 · 4 comments
Closed

Error when summarise refers to previously created variable #75

hadley opened this issue Jun 26, 2019 · 4 comments

Comments

@hadley
Copy link
Member

hadley commented Jun 26, 2019

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

lz <- lazy_dt(data.frame(x = 1:10))
lz %>% summarise(x = mean(x), y = x + 1) %>% collect()
#>       x  y
#>  1: 5.5  2
#>  2: 5.5  3
#>  3: 5.5  4
#>  4: 5.5  5
#>  5: 5.5  6
#>  6: 5.5  7
#>  7: 5.5  8
#>  8: 5.5  9
#>  9: 5.5 10
#> 10: 5.5 11

Created on 2019-06-26 by the reprex package (v0.2.1.9000)

Use same technique from dbplyr.

@dyrland
Copy link

dyrland commented Aug 22, 2019

I know this is closed, but I use summarise() all the time in the wild.

I often need to express variables in different ways, as different bosses like data presented differently. This leads to me doing multiple summaries for the same variables. For instance, below I need the number of Auto Lunches per Manager and then the percent that are Auto Lunches. I prefer the first, df.out, (it's the cleanest, IMHO) but the second, dt.out, is ok. The third is also fine for this MWE, but as the calculations become more complex repeating the code becomes unintelligible.

Of course, perhaps I could have data prep skills. :D

df.out <- df %>%
  group_by(Manager) %>%
  summarize(`Auto Lunches` = sum(auto.lunch == TRUE),
            `Gross Hours` = sum(auto.lunch.time) / -60,
            `Auto Lunch %` = `Auto Lunches` / n() * 100,
            `Auto Lunch % (Mods)` = `Auto Lunches` / sum(modified == TRUE) * 100
  )
   
dt <- lazy_dt(df)
dt.out1 <- dt %>%
  group_by(Manager) %>%
  summarize(`Auto Lunches` = sum(auto.lunch == TRUE),
            `Gross Hours` = sum(auto.lunch.time) / -60,
            n = n()) %>% 
  mutate(`Auto Lunch %` = `Auto Lunches` / n * 100,
         `Auto Lunch % (Mods)` = `Auto Lunches` / sum(modified == TRUE) * 100
  )

dt.out2 <- dt %>%
  group_by(Manager) %>%
  summarize(`Auto Lunches` = sum(auto.lunch == TRUE),
            `Gross Hours` = sum(auto.lunch.time) / -60,
            `Auto Lunch %` = sum(auto.lunch == TRUE) / n() * 100,
            `Auto Lunch % (Mods)` = sum(auto.lunch == TRUE) / 
              sum(modified == TRUE) * 100
  )

@dyrland
Copy link

dyrland commented Aug 22, 2019

Yikes! dt.out1 won't work because I have to create a modified variable. I guess another reason why I am hopeful that I can use previously created variables in summarise()...

@gpierard
Copy link

I had this error because my df was a data.table. Got it fixed after converting to tibble

@TheDohn
Copy link

TheDohn commented Jun 20, 2024

Could anyone elaborate on what:

Use same technique from dbplyr.

means? I can see how to avoid the error all together by calculating each summarize variable independently as it's own table, then joining the results together, but I am wondering if there is a better way using dtplyr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants