-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid triggering duckplyr's ALTREP row names #6947
Avoid triggering duckplyr's ALTREP row names #6947
Conversation
R/generics.R
Outdated
attributes <- dplyr_attributes(template) | ||
attributes$names <- names(data) | ||
attributes$row.names <- .row_names_info(data, type = 0L) | ||
dplyr_set_attributes(data, attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we have very limited needs so it might be worth looking into simplifying this operation as much as we can with a small routine that sets names with setAttrib()
and then walks a shallow-duplicated ATTRIB()
and assigns row names with SETCAR()
.
Another thing that isn't quite right about this is the ordering of the attributes that come in vs go out. {geodimension} has an unfortunate test that relies on the attribute ordering to be stable, and I think the current way Browse[2]> attributes(data)
$class
[1] "data.frame"
$names
[1] "state" "division" "nation" "region"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52
Browse[2]> attributes
$names
[1] "state" "division" "nation" "region"
$row.names
[1] NA -52
$class
[1] "tbl_df" "tbl" "data.frame"
Browse[2]> attributes(dplyr_set_attributes(data, attributes))
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52
$names
[1] "state" "division" "nation" "region"
$class
[1] "tbl_df" "tbl" "data.frame" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
156b58d
to
7a50f91
Compare
Thanks! Confirming that I no longer need a reconstruct method in duckplyr. |
Closes #6927 (alternative approach)
Part of #6525 (the other half was r-lib/vctrs#1847)
In #6525 (comment), we discovered that one of the places that duckplyr's ALTREP row names are being triggered is in
attributes()
, called on thetemplate
argument ofdplyr_reconstruct()
. Callingattributes()
will try and expand internally compact row names (i.e.c(NA, -5)
) into an ALTREP integer sequence https://github.com/wch/r-source/blob/69b94f0c8ce9b2497f6d7a81922575f6c585b713/src/main/attrib.c#L176-L177. This avoids leaking this internal detail to most users. But this requires callingINTEGER()
orINTEGER_ELT()
on the row names object. For duckplyr's lazy queries, which don't know the resulting number of rows until it happens, that is enough to trigger the lazy query.While working on this, I also discovered that
attributes<-()
is also a problem. It also has special behavior where it tries to create this internal compact row names representation during the setting process https://github.com/wch/r-source/blob/69b94f0c8ce9b2497f6d7a81922575f6c585b713/src/main/attrib.c#L57. Again, this requires looking atINTEGER()
. So ifdata
had duckplyr ALTREP row names, then it would trigger the query upon reconstruction too.I think the best solution for this is to come up with our own attribute getter / setter helpers that try and mimic most of what
do_attributes()
anddo_attributesgets()
do, but with special casing for row names. We could avoid the two helpers and just manipulate the underlying pairlists directly with a "copy most attributes" style helper, but I feel like we may need something like these helpers in the future, and it seemed a little cleaner to me to be able to expose these on the R side, as you get precise control over what "most attributes" means (here, everything but names and row names).Up until now, this has been managed in duckplyr itself, but we think it should live here so duckplyr can remove its C code.