Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R crashes when many variables in group_by in dplyr 0.8.0 #4090

Closed
dfalbel opened this issue Jan 8, 2019 · 6 comments
Closed

R crashes when many variables in group_by in dplyr 0.8.0 #4090

dfalbel opened this issue Jan 8, 2019 · 6 comments
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@dfalbel
Copy link

dfalbel commented Jan 8, 2019

The following code is crashing my R for some reason in dplyr 0.8.0. It works great with 0.7.8

library(dplyr)

n <- 400000

a <- data_frame(
  x = 1:n,
  a = sample(1:10, size = n, replace = TRUE),
  b = sample(1:10, size = n, replace = TRUE),
  c = sample(1:10, size = n, replace = TRUE),
  d = sample(1:10, size = n, replace = TRUE),
  e = sample(1:10, size = n, replace = TRUE),
  f = sample(1:10, size = n, replace = TRUE),
  g = sample(1:10, size = n, replace = TRUE),
  h = sample(1:10, size = n, replace = TRUE),
  i = sample(1:10, size = n, replace = TRUE),
  y = runif(n)
)

g_1 <- a %>% group_by(x)
g_2 <- a %>% group_by_at(vars(-y))

g_1 %>% summarise(y = sum(y))
g_2 %>% summarise(y = sum(y))

The last line blocks the R session and eventually crashes. My dplyr version is dplyr * 0.8.0.9000 2019-01-03 [1] Github (tidyverse/dplyr@f0993bb)

@romainfrancois romainfrancois added the bug an unexpected problem or unintended behavior label Jan 9, 2019
@romainfrancois romainfrancois added this to the 0.8.0 milestone Jan 9, 2019
@romainfrancois
Copy link
Member

Thanks. I can reproduce this. This looks related to how the grouping structure is done after the summarise.

@jangorecki
Copy link

jangorecki commented Jan 11, 2019

While 0a1b62d resolved the problem in code provided by @dfalbel I think there is more to fix here.
Using the below code I am trying to aggregate ~400 KB dataset (1e4 rows) and the process consumes up to 1.2 GB memory, which is around 3000 times more than input size. If I scale up input to 4 MB (1e5 rows) R process is killed by OS. When using master I was getting C stack overflow error.

N=1e4L
K=100L
set.seed(108)
DF = data.frame(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
print(object.size(DF), units="MB")

MB_used = function() as.numeric(system(sprintf("ps -o rss %s | tail -1", Sys.getpid()), intern=TRUE))/1024
suppressPackageStartupMessages(library(dplyr))
MB_used()
ans <- DF %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3), count=n())
MB_used()

@romainfrancois
Copy link
Member

With N = 1e4 this generates more than a million groups, much less with .drop = TRUE.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lobstr)

N=1e4L
K=100L
set.seed(108)
DF = data.frame(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
mem_used()
#> 42,526,976 B

grouped <- DF %>%
  group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#> [1] 1000053
mem_used()
#> 131,497,112 B

grouped <- DF %>%
  group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#> [1] 10000
mem_used()
#> 44,374,392 B

A million integer vectors will indeed create a lot of memory. Not sure what I can do here.

@jangorecki
Copy link

jangorecki commented Jan 11, 2019

@romainfrancois I am running your code on latest master and .drop=TRUE does not seems to have any effect, any idea what is wrong? Maybe it requires dev version of some dependency?

grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#[1] 1000053
grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#[1] 1000053
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 os       Ubuntu precise (12.04.5 LTS)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2019-01-11                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source                          
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)                  
 backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)                  
 base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.0)                  
 callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.0)                  
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)                  
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)                  
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)                  
 devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.0)                  
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)                  
 dplyr       * 0.8.0   2019-01-11 [1] Github (tidyverse/dplyr@a581466)
 fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)                  
 glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)                  
 lobstr      * 1.0.1   2018-12-21 [1] CRAN (R 3.5.0)                  
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.1.3)                  
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)                  
 pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)                  
 pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)                  
 pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)                  
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)                  
 prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)                  
 processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.0)                  
 ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.0)                  
 purrr         0.2.5   2018-05-29 [1] CRAN (R 3.5.0)                  
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)                  
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)                  
 remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)                  
 rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.0)                  
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)                  
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)                  
 testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.0)                  
 tibble        2.0.0   2019-01-04 [1] CRAN (R 3.5.0)                  
 tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)                  
 usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)                  
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)                  

@romainfrancois
Copy link
Member

Sorry about that, it needs this PR: #4091 which I'll merge today probably.

jangorecki added a commit to h2oai/db-benchmark that referenced this issue Jan 20, 2019
@lock
Copy link

lock bot commented Jul 10, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants