Skip to content

[meta] data.table and dplyr footguns #618

Open
0 of 1 issue completed
Open
0 of 1 issue completed
@dshemetov

Description

@dshemetov

This issue is meant to collect a number of problems that come from the intersection of data.table and dplyr. It's also a place to coordinate finding a solution.

The broad summary is that (a) dplyr can mismanage internal data.table attributes, while spoofing a valid data.table class, (b) dplyr can cause memory ownership issues by injecting variables in the data.table's memory model.

Example 1 (data.table and in-place memory operations)

Data.table has a number of in-place memory operations that side-step R's copy on write model (tl;dr: R will delay copying objects when assigning variables until those objects are written to and have more than one referencing variables). These are the := and the set* operations.

library(data.table)
 
dt = data.table(x = 1:3, y = 4:6)
tracemem(dt)
dt2 <- dt
 
# This will change dt2 and dt.
dt2[, z := 7:9]
print(dt)
 
This however does not
dt2[1, ] <- list(1, 2, 3)
print(dt)

Example 2 (dplyr can introduce aliasing)

Dplyr can side-step data.table's memory model and introduce variables into data.table that can then be modified in-place.

library(data.table)
library(dplyr)
library(magrittr)

DT <- data.table(a = 1:5)
vec <- 2:6
DT2 <- DT %>% mutate(b = vec)
# This will change vec
DT2[1:2, b := 1:2]
# See that vec has changed
print(vec)

Example 3 (dplyr can break data.table's internal promises about key-ordering)

Data.table keeps track of whether its table is sorted and avoids unnecessary sorts by using an attribute. Dplyr can alter the table without updating this attribute, breaking consistency.

DT <- data.table(a = 5:1, b = 6:10)
setkeyv(DT, "a")
DT
attr(DT, "sorted") # "a"

# Reverse the order of the rows with data.table and DT will resort.
DT[,a:=5:1]
attr(DT, "sorted") # NULL
DT
setkeyv(DT, "a")
attr(DT, "sorted") # "a"

# Reverse the order of the rows with dplyr and it will not resort, because
# dplyr leaves the data.table sorted attribute unchanged.
DT <- DT %>% arrange(desc(a))
attr(DT, "sorted") # "a"
setkeyv(DT, "a") # This will not sort the data.table, since the table thinks it's already sorted.
DT

Example 4

The previous example 3 can break epix_merge calls, since the merge operations assume tables ordered by keys, but the archives given to x and y can come in pretending to be sorted, if direct dplyr manipulation on the underlying DT was performed.

Sub-issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions