R crashes when many variables in group_by in dplyr 0.8.0 #4090

dfalbel · 2019-01-08T14:45:52Z

The following code is crashing my R for some reason in dplyr 0.8.0. It works great with 0.7.8

library(dplyr)

n <- 400000

a <- data_frame(
  x = 1:n,
  a = sample(1:10, size = n, replace = TRUE),
  b = sample(1:10, size = n, replace = TRUE),
  c = sample(1:10, size = n, replace = TRUE),
  d = sample(1:10, size = n, replace = TRUE),
  e = sample(1:10, size = n, replace = TRUE),
  f = sample(1:10, size = n, replace = TRUE),
  g = sample(1:10, size = n, replace = TRUE),
  h = sample(1:10, size = n, replace = TRUE),
  i = sample(1:10, size = n, replace = TRUE),
  y = runif(n)
)

g_1 <- a %>% group_by(x)
g_2 <- a %>% group_by_at(vars(-y))

g_1 %>% summarise(y = sum(y))
g_2 %>% summarise(y = sum(y))

The last line blocks the R session and eventually crashes. My dplyr version is dplyr * 0.8.0.9000 2019-01-03 [1] Github (tidyverse/dplyr@f0993bb)

The text was updated successfully, but these errors were encountered:

romainfrancois · 2019-01-09T16:04:20Z

Thanks. I can reproduce this. This looks related to how the grouping structure is done after the summarise.

jangorecki · 2019-01-11T04:54:56Z

While 0a1b62d resolved the problem in code provided by @dfalbel I think there is more to fix here.
Using the below code I am trying to aggregate ~400 KB dataset (1e4 rows) and the process consumes up to 1.2 GB memory, which is around 3000 times more than input size. If I scale up input to 4 MB (1e5 rows) R process is killed by OS. When using master I was getting C stack overflow error.

N=1e4L
K=100L
set.seed(108)
DF = data.frame(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
print(object.size(DF), units="MB")

MB_used = function() as.numeric(system(sprintf("ps -o rss %s | tail -1", Sys.getpid()), intern=TRUE))/1024
suppressPackageStartupMessages(library(dplyr))
MB_used()
ans <- DF %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3), count=n())
MB_used()

romainfrancois · 2019-01-11T08:46:45Z

With N = 1e4 this generates more than a million groups, much less with .drop = TRUE.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lobstr)

N=1e4L
K=100L
set.seed(108)
DF = data.frame(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
mem_used()
#> 42,526,976 B

grouped <- DF %>%
  group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#> [1] 1000053
mem_used()
#> 131,497,112 B

grouped <- DF %>%
  group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#> [1] 10000
mem_used()
#> 44,374,392 B

A million integer vectors will indeed create a lot of memory. Not sure what I can do here.

jangorecki · 2019-01-11T09:59:36Z

@romainfrancois I am running your code on latest master and .drop=TRUE does not seems to have any effect, any idea what is wrong? Maybe it requires dev version of some dependency?

grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#[1] 1000053
grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#[1] 1000053

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 os       Ubuntu precise (12.04.5 LTS)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2019-01-11                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source                          
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)                  
 backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)                  
 base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.0)                  
 callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.0)                  
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)                  
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)                  
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)                  
 devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.0)                  
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)                  
 dplyr       * 0.8.0   2019-01-11 [1] Github (tidyverse/dplyr@a581466)
 fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)                  
 glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)                  
 lobstr      * 1.0.1   2018-12-21 [1] CRAN (R 3.5.0)                  
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.1.3)                  
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)                  
 pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)                  
 pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)                  
 pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)                  
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)                  
 prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)                  
 processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.0)                  
 ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.0)                  
 purrr         0.2.5   2018-05-29 [1] CRAN (R 3.5.0)                  
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)                  
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)                  
 remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)                  
 rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.0)                  
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)                  
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)                  
 testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.0)                  
 tibble        2.0.0   2019-01-04 [1] CRAN (R 3.5.0)                  
 tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)                  
 usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)                  
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)

romainfrancois · 2019-01-11T10:01:40Z

Sorry about that, it needs this PR: #4091 which I'll merge today probably.

lock · 2019-07-10T10:27:52Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

romainfrancois added the bug an unexpected problem or unintended behavior label Jan 9, 2019

romainfrancois added this to the 0.8.0 milestone Jan 9, 2019

romainfrancois closed this as completed in 3c53f5f Jan 11, 2019

jangorecki added a commit to h2oai/db-benchmark that referenced this issue Jan 20, 2019

dplyr syntax according to tidyverse/dplyr#4090

d0f256f

lock bot locked and limited conversation to collaborators Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

R crashes when many variables in group_by in dplyr 0.8.0 #4090

R crashes when many variables in group_by in dplyr 0.8.0 #4090

dfalbel commented Jan 8, 2019 •

edited

Loading

romainfrancois commented Jan 9, 2019

Uh oh!

jangorecki commented Jan 11, 2019 •

edited

Loading

Uh oh!

romainfrancois commented Jan 11, 2019

Uh oh!

jangorecki commented Jan 11, 2019 •

edited

Loading

Uh oh!

romainfrancois commented Jan 11, 2019

Uh oh!

lock bot commented Jul 10, 2019

Uh oh!

R crashes when many variables in group_by in dplyr 0.8.0 #4090

R crashes when many variables in group_by in dplyr 0.8.0 #4090

Comments

dfalbel commented Jan 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

romainfrancois commented Jan 9, 2019

Uh oh!

jangorecki commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romainfrancois commented Jan 11, 2019

Uh oh!

jangorecki commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romainfrancois commented Jan 11, 2019

Uh oh!

lock bot commented Jul 10, 2019

Uh oh!

dfalbel commented Jan 8, 2019 •

edited

Loading

jangorecki commented Jan 11, 2019 •

edited

Loading

jangorecki commented Jan 11, 2019 •

edited

Loading