Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Ensure ProxyDMatrix creation keeps data until next iteration #11092

Merged
merged 4 commits into from
Dec 11, 2024

Conversation

david-cortes
Copy link
Contributor

ref #11088
ref #9810

This PR modifies the logic of the ProxyDMatrix loop to ensure that whatever data it gets fed will be kept alive until the next iteration.

@@ -682,6 +682,7 @@ XGB_DLL SEXP XGProxyDMatrixCreate_R() {

XGB_DLL SEXP XGProxyDMatrixSetDataDense_R(SEXP handle, SEXP R_mat) {
R_API_BEGIN();
R_SetExternalPtrProtected(handle, R_mat);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so when is the data released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected:
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/main/memory.c#L3855
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/main/memory.c#L1292

But I'm not 100% sure about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through C macros is not my strong suit ...

Do you think it's difficult to store the data in the ProxyDMatrixWrapper you wrote in xgboost_R.cc? Currently, it's not used in the R code, but we can make the R proxy dmatrix creator return the wrapper instead of the real proxy dmatrix handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, I store the temporary data in the iterator, but anywhere is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the code again, it looks like both xgb.ExtMemDMatrix and xgb.QuantileDMatrix.from_iterator will end up calling one of these 3 functions in the end, so please ignore the latter comment.

Do you think it's difficult to store the data in the ProxyDMatrixWrapper you wrote in xgboost_R.cc?

I think it could be quite tricky, because the public R C API doesn't allow protecting/unprotecting individual objects (unless by the logic of putting an unprotected object inside a protected container like list/externalptr) - it only offers UNPROTECT(<last N>).

It'd have to use a list object and be constantly assigning the same element to it in order to achieve the same effect, which wouldn't have any advantage compared to the mechanism in this PR of assigning to the "protected" slot of the externalptr, assuming that it works the way we expect it to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming that it works the way we expect it to.

That's internal to R right? Even if it works today, it can break at the next release and we don't have a way to test that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming that it works the way we expect it to.

That's internal to R right? Even if it works today, it can break at the next release and we don't have a way to test that.

No, that's part of the public C API of R, which has pretty strong guarantees. Internal API calls in package code would disqualify packages from being hosted at CRAN.

And either way, it looks like it should end up using the macro from the earlier link that decreases the reference count, unless you manually happen to define internal-use variables before including the R header:
https://github.com/wch/r-source/blob/872770ededfc26429adca09fab0803e1b78b6eac/src/include/Rinternals.h#L191

Copy link
Member

@trivialfis trivialfis Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's part of the public C API of R

hmm, I meant the "external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected" behavior. I think it can change without notice since it's not documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's part of the public C API of R

hmm, I meant the "external pointer object has only one slot for a protected SEXP object. So a call to this function would make the previous object in that slot become unprotected" behavior. I think it can change without notice since it's not documented.

I highly doubt that'd the case, because that's how it works for other objects like lists too, even if the behavior is not documented for all classes.

And even for undocumented behavior, the R team is constantly testing these sorts of things with third-package packages and only makes breaking changes after notifying affected packages, with a very long notice period given to make the change and announcements in the release notes for changes in behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thank you for sharing.

@david-cortes
Copy link
Contributor Author

Looks like xgb.ExtMemDMatrix and xgb.QuantileDMatrix.from_iterator will also need similar protection mechanisms.

@david-cortes
Copy link
Contributor Author

david-cortes commented Dec 11, 2024

@trivialfis Just to make sure: in this PR, the data is (was, changed with the last commit) being held right up until the next call to XGProxyDMatrixSetData*.

But it gets unprotected before the next call with the next batch of data.

Is that the correct behavior or should it be the other way around? i.e. old batch kept alive until current batch setdata finished.

@trivialfis
Copy link
Member

trivialfis commented Dec 11, 2024

Is that the correct behavior or should it be the other way around? i.e. old batch kept alive until current batch setdata finished.

The previous batch can be freed once the next or the reset is called. The events happen in the following order:

  1. iter->next (in R)
  2. generate temp data
  3. set_data called inside the next , exit the next call
  4. xgboost consumes the data in C
  5. iter->next (in R)
  6. generate temp data
  7. set_data called inside the next, exit next call.
  8. xgboost consumes the data in C
    ...

The free for data generated by 2 can happen between 5 and 8. If freed before 6, then we can save some memory as there's only one batch of temporary data can be in the memory. The saved memory doesn't matter much in most cases.

@david-cortes
Copy link
Contributor Author

Is that the correct behavior or should it be the other way around? i.e. old batch kept alive until current batch setdata finished.

The previous batch can be freed once the next or the reset is called. The events happen in the following order:

  1. iter->next (in R)
  2. generate temp data
  3. set_data called inside the next , exit the next call
  4. xgboost consumes the data in C
  5. iter->next (in R)
  6. generate temp data
  7. set_data called inside the next, exit next call.
  8. xgboost consumes the data in C
    ...

The free for data generated by 2 can happen between 5 and 8. If freed before 6, then we can save some memory as there's only one batch of temporary data can be in the memory. The saved memory doesn't matter much in most cases.

But when is (4) finalized? In this case, the only interactions with the DMatrix object are through setdata. Or am I missing some additional call?

@trivialfis
Copy link
Member

But when is (4) finalized?

When 5 happens.

@david-cortes
Copy link
Contributor Author

But when is (4) finalized?

When 5 happens.

Meaning: inside the loop in *FromCallback before calling XGDMatrixCallbackNext(DataIterHandle) ?

@trivialfis
Copy link
Member

trivialfis commented Dec 11, 2024

the only interactions with the DMatrix object are through setdata. Or am I missing some additional call?

No missing call.

In the previous comment, both the setdata and generation of temp data happen INSIDE the next call:

function iter.next() {
   # free here
    xgb.ProxyDMatrix(  # The R function defined in line 555 in xgb.DMatrix.R
        temp = process_df()
        set_data(temp)
    )
    # free here works as well, but we can have two temp data in the memory.
}

Inside xgboost:

while iter.next():
    consume_data_in_proxy_dmatrix

@david-cortes
Copy link
Contributor Author

I'm still not getting when exactly would the DMatrix be finished with the next batch, but I've added some calls to unprotect the previous batch right before the call to the R-side 'next'. It should in theory allow freeing the earlier batch before the next one is allocated in memory when using external memory.

Please take a look to see if that's the correct place.

@trivialfis
Copy link
Member

trivialfis commented Dec 11, 2024

I'm still not getting when exactly would the DMatrix be finished with the next batch

may I ask where is the confusion? Apologies for my poor explanation. If a new batch is requested by XGBoost, then the old batch is finished, and XGBoost requests a new batch by calling the next function. Not entirely sure where it got complicated.

@david-cortes
Copy link
Contributor Author

I'm still not getting when exactly would the DMatrix be finished with the next batch

may I ask where is the confusion? Apologies for my poor explanation. If a new batch is requested by XGBoost, then the old batch is finished, and XGBoost requests a new batch by calling the next function. Not entirely sure where it got complicated.

Got it now. So then the current logic as of the last commit should work in such a way that it doesn't require two batches to co-exist in memory.

But wouldn't it be more logical to make it so that the data is consumed during the call to setdata and not referenced after that call, considering that there aren't any further user functions running between the two?

@trivialfis
Copy link
Member

trivialfis commented Dec 11, 2024

The previous batch can be freed inside anywhere of this function:

lst <- data_iterator$f_next(data_iterator$env)

preferably before:

tmp <- .process.df.for.dmatrix(lst$data, lst$feature_types)

since we will have two temporary batches (old and new tmp) in memory if we free after this line. But it's fine.

@trivialfis
Copy link
Member

trivialfis commented Dec 11, 2024

But wouldn't it be more logical to make it so that the data is consumed during the call to setdata

It couldn't, the proxy dmatrix is just a proxy to the user input as the name suggests, and doesn't do anything other than hold the reference to the data. Consider it a "data view" that XGBoost understands, instead of a container of the data. We have to create something that wraps the high-level user input for the C++ code to consume, the proxy dmatrix is the wrapper. I should probably call it DataReference/DataWrapper or something like that instead of ProxyDMatrix.

@david-cortes
Copy link
Contributor Author

The previous batch can be freed inside anywhere of this function:

lst <- data_iterator$f_next(data_iterator$env)

preferably before:

tmp <- .process.df.for.dmatrix(lst$data, lst$feature_types)

since we will have two temporary batches (old and new tmp) in memory if we free after this line. But it's fine.

Then the current placement should be fine, since this function:

xgb.ProxyDMatrix <- function(proxy_handle, data_iterator) {

Is only called from the C++ side here:

SEXP R_res = Rf_protect(

.. through wrappers like these:

iterator_next <- function() {

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! Looks great.

@trivialfis trivialfis merged commit 3162e0d into dmlc:master Dec 11, 2024
57 of 58 checks passed
@david-cortes
Copy link
Contributor Author

Looks like this didn't work out correctly. Now it ends up with a null pointer derefence somewhere.

@trivialfis
Copy link
Member

trivialfis commented Dec 15, 2024

Maybe we lower the data transformation into C and store the temporary data in the proxy wrapper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants