-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Ensure ProxyDMatrix
creation keeps data until next iteration
#11092
Conversation
R-package/src/xgboost_R.cc
Outdated
@@ -682,6 +682,7 @@ XGB_DLL SEXP XGProxyDMatrixCreate_R() { | |||
|
|||
XGB_DLL SEXP XGProxyDMatrixSetDataDense_R(SEXP handle, SEXP R_mat) { | |||
R_API_BEGIN(); | |||
R_SetExternalPtrProtected(handle, R_mat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so when is the data released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the external pointer object has only one slot for a protected SEXP
object. So a call to this function would make the previous object in that slot become unprotected:
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/main/memory.c#L3855
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/main/memory.c#L1292
But I'm not 100% sure about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through C macros is not my strong suit ...
Do you think it's difficult to store the data in the ProxyDMatrixWrapper
you wrote in xgboost_R.cc
? Currently, it's not used in the R code, but we can make the R proxy dmatrix creator return the wrapper instead of the real proxy dmatrix handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, I store the temporary data in the iterator, but anywhere is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through the code again, it looks like both xgb.ExtMemDMatrix
and xgb.QuantileDMatrix.from_iterator
will end up calling one of these 3 functions in the end, so please ignore the latter comment.
Do you think it's difficult to store the data in the ProxyDMatrixWrapper you wrote in xgboost_R.cc?
I think it could be quite tricky, because the public R C API doesn't allow protecting/unprotecting individual objects (unless by the logic of putting an unprotected object inside a protected container like list/externalptr) - it only offers UNPROTECT(<last N>)
.
It'd have to use a list object and be constantly assigning the same element to it in order to achieve the same effect, which wouldn't have any advantage compared to the mechanism in this PR of assigning to the "protected" slot of the externalptr, assuming that it works the way we expect it to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming that it works the way we expect it to.
That's internal to R right? Even if it works today, it can break at the next release and we don't have a way to test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming that it works the way we expect it to.
That's internal to R right? Even if it works today, it can break at the next release and we don't have a way to test that.
No, that's part of the public C API of R, which has pretty strong guarantees. Internal API calls in package code would disqualify packages from being hosted at CRAN.
And either way, it looks like it should end up using the macro from the earlier link that decreases the reference count, unless you manually happen to define internal-use variables before including the R header:
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/include/Rinternals.h#L191
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's part of the public C API of R
hmm, I meant the "external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected" behavior. I think it can change without notice since it's not documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's part of the public C API of R
hmm, I meant the "external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected" behavior. I think it can change without notice since it's not documented.
I highly doubt that'd the case, because that's how it works for other objects like lists too, even if the behavior is not documented for all classes.
And even for undocumented behavior, the R team is constantly testing these sorts of things with third-package packages and only makes breaking changes after notifying affected packages, with a very long notice period given to make the change and announcements in the release notes for changes in behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thank you for sharing.
Looks like |
@trivialfis Just to make sure: in this PR, the data is (was, changed with the last commit) being held right up until the next call to But it gets unprotected before the next call with the next batch of data. Is that the correct behavior or should it be the other way around? i.e. old batch kept alive until current batch setdata finished. |
The previous batch can be freed once the
The |
But when is (4) finalized? In this case, the only interactions with the DMatrix object are through |
When 5 happens. |
Meaning: inside the loop in |
No missing call. In the previous comment, both the
Inside xgboost:
|
I'm still not getting when exactly would the DMatrix be finished with the next batch, but I've added some calls to unprotect the previous batch right before the call to the R-side 'next'. It should in theory allow freeing the earlier batch before the next one is allocated in memory when using external memory. Please take a look to see if that's the correct place. |
may I ask where is the confusion? Apologies for my poor explanation. If a new batch is requested by XGBoost, then the old batch is finished, and XGBoost requests a new batch by calling the next function. Not entirely sure where it got complicated. |
Got it now. So then the current logic as of the last commit should work in such a way that it doesn't require two batches to co-exist in memory. But wouldn't it be more logical to make it so that the data is consumed during the call to |
The previous batch can be freed inside anywhere of this function: xgboost/R-package/R/xgb.DMatrix.R Line 557 in 376009c
preferably before: xgboost/R-package/R/xgb.DMatrix.R Line 569 in 376009c
since we will have two temporary batches (old and new |
It couldn't, the proxy dmatrix is just a proxy to the user input as the name suggests, and doesn't do anything other than hold the reference to the data. Consider it a "data view" that XGBoost understands, instead of a container of the data. We have to create something that wraps the high-level user input for the C++ code to consume, the proxy dmatrix is the wrapper. I should probably call it |
Then the current placement should be fine, since this function: xgboost/R-package/R/xgb.DMatrix.R Line 556 in 376009c
Is only called from the C++ side here: xgboost/R-package/src/xgboost_R.cc Line 747 in 376009c
.. through wrappers like these: xgboost/R-package/R/xgb.DMatrix.R Line 365 in 376009c
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix! Looks great.
Looks like this didn't work out correctly. Now it ends up with a null pointer derefence somewhere. |
Maybe we lower the data transformation into C and store the temporary data in the proxy wrapper? |
ref #11088
ref #9810
This PR modifies the logic of the ProxyDMatrix loop to ensure that whatever data it gets fed will be kept alive until the next iteration.