Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid triggering duckplyr's ALTREP row names #6947

Merged
merged 5 commits into from
Nov 6, 2023

Conversation

DavisVaughan
Copy link
Member

Closes #6927 (alternative approach)
Part of #6525 (the other half was r-lib/vctrs#1847)

In #6525 (comment), we discovered that one of the places that duckplyr's ALTREP row names are being triggered is in attributes(), called on the template argument of dplyr_reconstruct(). Calling attributes() will try and expand internally compact row names (i.e. c(NA, -5)) into an ALTREP integer sequence https://github.com/wch/r-source/blob/69b94f0c8ce9b2497f6d7a81922575f6c585b713/src/main/attrib.c#L176-L177. This avoids leaking this internal detail to most users. But this requires calling INTEGER() or INTEGER_ELT() on the row names object. For duckplyr's lazy queries, which don't know the resulting number of rows until it happens, that is enough to trigger the lazy query.

While working on this, I also discovered that attributes<-() is also a problem. It also has special behavior where it tries to create this internal compact row names representation during the setting process https://github.com/wch/r-source/blob/69b94f0c8ce9b2497f6d7a81922575f6c585b713/src/main/attrib.c#L57. Again, this requires looking at INTEGER(). So if data had duckplyr ALTREP row names, then it would trigger the query upon reconstruction too.

I think the best solution for this is to come up with our own attribute getter / setter helpers that try and mimic most of what do_attributes() and do_attributesgets() do, but with special casing for row names. We could avoid the two helpers and just manipulate the underlying pairlists directly with a "copy most attributes" style helper, but I feel like we may need something like these helpers in the future, and it seemed a little cleaner to me to be able to expose these on the R side, as you get precise control over what "most attributes" means (here, everything but names and row names).

Up until now, this has been managed in duckplyr itself, but we think it should live here so duckplyr can remove its C code.

src/attributes.cpp Outdated Show resolved Hide resolved
src/attributes.cpp Outdated Show resolved Hide resolved
src/attributes.cpp Outdated Show resolved Hide resolved
R/generics.R Outdated
Comment on lines 205 to 208
attributes <- dplyr_attributes(template)
attributes$names <- names(data)
attributes$row.names <- .row_names_info(data, type = 0L)
dplyr_set_attributes(data, attributes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we have very limited needs so it might be worth looking into simplifying this operation as much as we can with a small routine that sets names with setAttrib() and then walks a shallow-duplicated ATTRIB() and assigns row names with SETCAR().

@DavisVaughan
Copy link
Member Author

Another thing that isn't quite right about this is the ordering of the attributes that come in vs go out. {geodimension} has an unfortunate test that relies on the attribute ordering to be stable, and I think the current way r_attrib_push() works means that row.names get pushed to the front

Browse[2]> attributes(data)
$class
[1] "data.frame"

$names
[1] "state"    "division" "nation"   "region"  

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52

Browse[2]> attributes
$names
[1] "state"    "division" "nation"   "region"  

$row.names
[1]  NA -52

$class
[1] "tbl_df"     "tbl"        "data.frame"

Browse[2]> attributes(dplyr_set_attributes(data, attributes))
$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52

$names
[1] "state"    "division" "nation"   "region"  

$class
[1] "tbl_df"     "tbl"        "data.frame"

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 2, 2023

@lionel- I rewrote this to be a pure pairlist-based approach. It is approaching the original approach from #6927, but with some simplifying assumptions, comments, and internal error checks.

I think it needs a full re-review unfortunately

Revdeps running at "006d5722-c48b-46b5-8752-a3523447ba0b"

Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

src/reconstruct.cpp Show resolved Hide resolved
@DavisVaughan DavisVaughan merged commit 3d6c047 into tidyverse:main Nov 6, 2023
11 checks passed
@DavisVaughan DavisVaughan deleted the feature/cpp-reconstruct branch November 6, 2023 14:16
@krlmlr
Copy link
Member

krlmlr commented Nov 6, 2023

Thanks! Confirming that I no longer need a reconstruct method in duckplyr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants