-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] data.table and dplyr footguns #618
Comments
We discussed a few responses: Mitigation A: distrust key in
|
We may also need to review epiprocess internals for these footguns. Caught an |
Another thing to check epiprocess conversions for and/or report upstream library(dplyr)
library(data.table)
data.frame(t = 1:5, y = 1:5) %>% as.data.table(key = "t") %>% key()
#> [1] "t"
tibble(t = 1:5, y = 1:5) %>% as.data.table(key = "t") %>% key()
#> NULL from this line in
[Filed this one upstream here. I also need this to get #611 tests to run.] |
This issue is meant to collect a number of problems that come from the intersection of data.table and dplyr. It's also a place to coordinate finding a solution.
The broad summary is that (a) dplyr can mismanage internal data.table attributes, while spoofing a valid data.table class, (b) dplyr can cause memory ownership issues by injecting variables in the data.table's memory model.
Example 1 (data.table and in-place memory operations)
Data.table has a number of in-place memory operations that side-step R's copy on write model (tl;dr: R will delay copying objects when assigning variables until those objects are written to and have more than one referencing variables). These are the
:=
and theset*
operations.Example 2 (dplyr can introduce aliasing)
Dplyr can side-step data.table's memory model and introduce variables into data.table that can then be modified in-place.
Example 3 (dplyr can break data.table's internal promises about key-ordering)
Data.table keeps track of whether its table is sorted and avoids unnecessary sorts by using an attribute. Dplyr can alter the table without updating this attribute, breaking consistency.
Example 4
The previous example 3 can break epix_merge calls, since the merge operations assume tables ordered by keys, but the archives given to x and y can come in pretending to be sorted, if direct dplyr manipulation on the underlying DT was performed.
The text was updated successfully, but these errors were encountered: