-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R crashes when many variables in group_by in dplyr 0.8.0 #4090
Comments
Thanks. I can reproduce this. This looks related to how the grouping structure is done after the summarise. |
While 0a1b62d resolved the problem in code provided by @dfalbel I think there is more to fix here. N=1e4L
K=100L
set.seed(108)
DF = data.frame(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
print(object.size(DF), units="MB")
MB_used = function() as.numeric(system(sprintf("ps -o rss %s | tail -1", Sys.getpid()), intern=TRUE))/1024
suppressPackageStartupMessages(library(dplyr))
MB_used()
ans <- DF %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3), count=n())
MB_used() |
With library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(lobstr)
N=1e4L
K=100L
set.seed(108)
DF = data.frame(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
mem_used()
#> 42,526,976 B
grouped <- DF %>%
group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#> [1] 1000053
mem_used()
#> 131,497,112 B
grouped <- DF %>%
group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#> [1] 10000
mem_used()
#> 44,374,392 B A million integer vectors will indeed create a lot of memory. Not sure what I can do here. |
@romainfrancois I am running your code on latest master and grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6)
n_groups(grouped)
#[1] 1000053
grouped <- DF %>% group_by(id1, id2, id3, id4, id5, id6, .drop = TRUE)
n_groups(grouped)
#[1] 1000053
|
Sorry about that, it needs this PR: #4091 which I'll merge today probably. |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
The following code is crashing my R for some reason in dplyr 0.8.0. It works great with 0.7.8
The last line blocks the R session and eventually crashes. My
dplyr
version isdplyr * 0.8.0.9000 2019-01-03 [1] Github (tidyverse/dplyr@f0993bb)
The text was updated successfully, but these errors were encountered: