Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Ensure ProxyDMatrix creation keeps data until next iteration #11092

Merged
merged 4 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion R-package/R/xgb.DMatrix.R
Original file line number Diff line number Diff line change
Expand Up @@ -569,7 +569,6 @@ xgb.ProxyDMatrix <- function(proxy_handle, data_iterator) {
tmp <- .process.df.for.dmatrix(lst$data, lst$feature_types)
lst$feature_types <- tmp$feature_types
.Call(XGProxyDMatrixSetDataColumnar_R, proxy_handle, tmp$lst)
rm(tmp)
} else if (is.matrix(lst$data)) {
.Call(XGProxyDMatrixSetDataDense_R, proxy_handle, lst$data)
} else if (inherits(lst$data, "dgRMatrix")) {
Expand Down
3 changes: 3 additions & 0 deletions R-package/src/xgboost_R.cc
Original file line number Diff line number Diff line change
Expand Up @@ -682,6 +682,7 @@ XGB_DLL SEXP XGProxyDMatrixCreate_R() {

XGB_DLL SEXP XGProxyDMatrixSetDataDense_R(SEXP handle, SEXP R_mat) {
R_API_BEGIN();
R_SetExternalPtrProtected(handle, R_mat);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so when is the data released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected:
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/main/memory.c#L3855
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/main/memory.c#L1292

But I'm not 100% sure about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through C macros is not my strong suit ...

Do you think it's difficult to store the data in the ProxyDMatrixWrapper you wrote in xgboost_R.cc? Currently, it's not used in the R code, but we can make the R proxy dmatrix creator return the wrapper instead of the real proxy dmatrix handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, I store the temporary data in the iterator, but anywhere is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the code again, it looks like both xgb.ExtMemDMatrix and xgb.QuantileDMatrix.from_iterator will end up calling one of these 3 functions in the end, so please ignore the latter comment.

Do you think it's difficult to store the data in the ProxyDMatrixWrapper you wrote in xgboost_R.cc?

I think it could be quite tricky, because the public R C API doesn't allow protecting/unprotecting individual objects (unless by the logic of putting an unprotected object inside a protected container like list/externalptr) - it only offers UNPROTECT(<last N>).

It'd have to use a list object and be constantly assigning the same element to it in order to achieve the same effect, which wouldn't have any advantage compared to the mechanism in this PR of assigning to the "protected" slot of the externalptr, assuming that it works the way we expect it to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming that it works the way we expect it to.

That's internal to R right? Even if it works today, it can break at the next release and we don't have a way to test that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming that it works the way we expect it to.

That's internal to R right? Even if it works today, it can break at the next release and we don't have a way to test that.

No, that's part of the public C API of R, which has pretty strong guarantees. Internal API calls in package code would disqualify packages from being hosted at CRAN.

And either way, it looks like it should end up using the macro from the earlier link that decreases the reference count, unless you manually happen to define internal-use variables before including the R header:
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/include/Rinternals.h#L191

Copy link
Member

@trivialfis trivialfis Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's part of the public C API of R

hmm, I meant the "external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected" behavior. I think it can change without notice since it's not documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's part of the public C API of R

hmm, I meant the "external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected" behavior. I think it can change without notice since it's not documented.

I highly doubt that'd the case, because that's how it works for other objects like lists too, even if the behavior is not documented for all classes.

And even for undocumented behavior, the R team is constantly testing these sorts of things with third-package packages and only makes breaking changes after notifying affected packages, with a very long notice period given to make the change and announcements in the release notes for changes in behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thank you for sharing.

DMatrixHandle proxy_dmat = R_ExternalPtrAddr(handle);
int res_code;
{
Expand All @@ -695,6 +696,7 @@ XGB_DLL SEXP XGProxyDMatrixSetDataDense_R(SEXP handle, SEXP R_mat) {

XGB_DLL SEXP XGProxyDMatrixSetDataCSR_R(SEXP handle, SEXP lst) {
R_API_BEGIN();
R_SetExternalPtrProtected(handle, lst);
DMatrixHandle proxy_dmat = R_ExternalPtrAddr(handle);
int res_code;
{
Expand All @@ -715,6 +717,7 @@ XGB_DLL SEXP XGProxyDMatrixSetDataCSR_R(SEXP handle, SEXP lst) {

XGB_DLL SEXP XGProxyDMatrixSetDataColumnar_R(SEXP handle, SEXP lst) {
R_API_BEGIN();
R_SetExternalPtrProtected(handle, lst);
DMatrixHandle proxy_dmat = R_ExternalPtrAddr(handle);
int res_code;
{
Expand Down
Loading