Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] data.table and dplyr footguns #618

Open
dshemetov opened this issue Mar 1, 2025 · 3 comments
Open

[meta] data.table and dplyr footguns #618

dshemetov opened this issue Mar 1, 2025 · 3 comments
Assignees

Comments

@dshemetov
Copy link
Contributor

dshemetov commented Mar 1, 2025

This issue is meant to collect a number of problems that come from the intersection of data.table and dplyr. It's also a place to coordinate finding a solution.

The broad summary is that (a) dplyr can mismanage internal data.table attributes, while spoofing a valid data.table class, (b) dplyr can cause memory ownership issues by injecting variables in the data.table's memory model.

Example 1 (data.table and in-place memory operations)

Data.table has a number of in-place memory operations that side-step R's copy on write model (tl;dr: R will delay copying objects when assigning variables until those objects are written to and have more than one referencing variables). These are the := and the set* operations.

library(data.table)
 
dt = data.table(x = 1:3, y = 4:6)
tracemem(dt)
dt2 <- dt
 
# This will change dt2 and dt.
dt2[, z := 7:9]
print(dt)
 
This however does not
dt2[1, ] <- list(1, 2, 3)
print(dt)

Example 2 (dplyr can introduce aliasing)

Dplyr can side-step data.table's memory model and introduce variables into data.table that can then be modified in-place.

library(data.table)
library(dplyr)
library(magrittr)

DT <- data.table(a = 1:5)
vec <- 2:6
DT2 <- DT %>% mutate(b = vec)
# This will change vec
DT2[1:2, b := 1:2]
# See that vec has changed
print(vec)

Example 3 (dplyr can break data.table's internal promises about key-ordering)

Data.table keeps track of whether its table is sorted and avoids unnecessary sorts by using an attribute. Dplyr can alter the table without updating this attribute, breaking consistency.

DT <- data.table(a = 5:1, b = 6:10)
setkeyv(DT, "a")
DT
attr(DT, "sorted") # "a"

# Reverse the order of the rows with data.table and DT will resort.
DT[,a:=5:1]
attr(DT, "sorted") # NULL
DT
setkeyv(DT, "a")
attr(DT, "sorted") # "a"

# Reverse the order of the rows with dplyr and it will not resort, because
# dplyr leaves the data.table sorted attribute unchanged.
DT <- DT %>% arrange(desc(a))
attr(DT, "sorted") # "a"
setkeyv(DT, "a") # This will not sort the data.table, since the table thinks it's already sorted.
DT

Example 4

The previous example 3 can break epix_merge calls, since the merge operations assume tables ordered by keys, but the archives given to x and y can come in pretending to be sorted, if direct dplyr manipulation on the underlying DT was performed.

@brookslogan
Copy link
Contributor

brookslogan commented Mar 1, 2025

We discussed a few responses:

Mitigation A: distrust key in as_epi_archive() and/or new_epi_archive()

Force sorting by the intended key to occur. If we're given a data.table, do something like %>% as.data.frame() %>% setDT(key = key_vars). This prevents both us (``identical(key_vars, key(data_table))) and data.table` from trusting the sortedness attributes. Addresses Example 3.

Note: as.data.table already ensures we copy, dealing with Example 2. This probably resolves most/all soundness problems with $DT %>% <dplyr ops> %>% as_epi_archive(). Might be wary of other ops like $<-, inset ([<-) and inset2 ([[<-); haven't tested logic there.

[Throwing this into #611, as it fits naturally into some other fixes below that seemed semi-necessary in #611 tests.]

Solution B: provide dplyr verbs we need for archives

Approach 1: converting to basic data.frame or tibble, forwarding to dplyr, converting back to data.table

Might involve 1 or 2 explicit copies of the table.

  • Requires some thought to determine whether we can skip those copies.
  • We may need to add special processing for renaming/dropping of key columns and/or required columns, invalidation of versions_end, etc. We may want to restrict some operations from touching the version column since with diffed data it's harder to reason about.
  • Operations like epix_merge() shouldn't need changes.

Approach 2: using {dtplyr}

Replace archive $DT data.tables with lazy_dts.

  • dplyr verb implementations would forward to lazy_dt impls rather than data.frame ones.
  • Some of the work validating/updating metadata may already done for us, but probably the majority is still there.
  • Operations relying on data.table ops not available in dplyr+dtplyr would need to convert from lazy_dt to DT, perform the operations, then convert back to lazy_dt.

One thing to check up on: as.data.table(lazy_dt(DT)) aliases DT; that might potentially be trouble.

Another thing: check computational speed & max RAM usage. Don't expect anything too disastrous. Could potentially make more of a difference if we make a bunch of small archives, but when I've wanted to do that it also ends up being better performance-wise to just destructure into not-an-archive.


Adding some later notes here as well.

Solution C: encapsulate $DT

Don't let users access $DT directly; require them to go through some function that will properly copy and/or convert to a tibble (via %>% as.data.frame() %>% as_tibble()) that will properly copy and/or convert to a tibble.

This seems like a more proper solution than Mitigation A, which still involves temporarily violating data.table's memory model, even if it is indeed unlikely/impossible to generate errors. However, compared to Solution B, it's not solving the other issue of providing common dplyr verb implementations for epi_archives. It could be coupled with Solution B, and might be a utility we would want to use inside some dplyr verb implementations (though there may be more performant approaches).

@brookslogan
Copy link
Contributor

brookslogan commented Mar 5, 2025

We may also need to review epiprocess internals for these footguns. Caught an arrange() on data.table in apply_compactify() in an edge case that hopefully we wouldn't trigger but would be bad if we did. Folding that change into #611.

@brookslogan
Copy link
Contributor

brookslogan commented Mar 10, 2025

Another thing to check epiprocess conversions for and/or report upstream

library(dplyr)
library(data.table)
data.frame(t = 1:5, y = 1:5) %>% as.data.table(key = "t") %>% key()
#> [1] "t"
tibble(t = 1:5, y = 1:5) %>% as.data.table(key = "t") %>% key()
#> NULL

from this line in data.table:::as.data.table.data.frame:

  if (!identical(class(x), "data.frame")) return(as.data.table(as.data.frame(x)))

[Filed this one upstream here. I also need this to get #611 tests to run.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants