-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Add data iterator, quantile dmatrix, and missing feature_types
#9913
Conversation
@@ -32,6 +32,10 @@ export(slice) | |||
export(xgb.DMatrix) | |||
export(xgb.DMatrix.hasinfo) | |||
export(xgb.DMatrix.save) | |||
export(xgb.DataIter) | |||
export(xgb.ExternalDMatrix) | |||
export(xgb.ProxyDMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to export the proxy DMatrix, users are unlikely to interact with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the function that needs to be called inside the data iterator. The class constructor is not exported.
@@ -32,6 +32,10 @@ export(slice) | |||
export(xgb.DMatrix) | |||
export(xgb.DMatrix.hasinfo) | |||
export(xgb.DMatrix.save) | |||
export(xgb.DataIter) | |||
export(xgb.ExternalDMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not expose this as an independent class for now. One day we might support external storage with QuantileDMatrix.
What do you think if we make the data
parameter of DMatrix
to be compatible with the iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If QuantileDMatrix
is a different class from from DMatrix
, I think it does make sense to have an ExternalDMatrix
as a different class, since just like QDM, it behaves differently from a regular DMatrix.
If implemented through xgb.DMatrix
by passing an iterator as first argument, I think the function signature would be quite unintuitive, since for example label
would be an argument to both the iterator's ProxyDMatrix
and to DMatrix
.
Would also leave the question about what happens if e.g. the user passes some features names in the iterator, and then different names in DMatrix, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I have been struggling with this since the first review and have been stalling the decision (my bad). Will look closer in the new class.
R-package/R/xgb.DMatrix.R
Outdated
#' with a value of zero when the stream of data ends (all batches in the iterator have been consumed), | ||
#' or an integer with a positive value when there are more batches in the line to be consumed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is one of the paper cuts I made to XGBoost, I think it's better to use boolean value instead of 0 and 1. At the time I was dealing with the C callbacks and the need to create appropriate wrappers skipped my mind.
R-package/R/xgb.DMatrix.R
Outdated
#' \item If created as a regular 'DMatrix', the data is accessed on-demand as needed, multiple | ||
#' times, without being concatenated (but note that fields like 'label' \bold{will} be concatenated | ||
#' from multiple calls to the data iterator). | ||
#' \item If created as a 'QuantileDMatrix', the quantized representation of the full data will | ||
#' be created in memory, concatenated from multiple calls to the data iterator. The quantized | ||
#' version is typically lighter than the original data, so there might be cases in which this | ||
#' representation could potentially fit in memory even if the full data doesn't. | ||
#' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remain wary about the interface here. We currently have:
- DMatrix
- QuantileDMatrix
- DMatrix + Iter => External Memory training
- QuantileDMatrix + iter => Instance construction != External memory training.
As you are already aware, the last option doesn't fetch data from external memory during training, it's more an option for distributed frameworks that manage data by chunks. In XGBoost, when we talk about external memory, we usually specifically mean training, which is fetching data on demand.
- (Potential future) Quantile + iter => External memory training.
I think there's a need to clarify these types (and their names). I agree with the preference that external memory DMatrix can have its type, but mixing it with QDM seems a bit confusing. From a Python/C++ programmer's perspective, maybe we don't need a new type for it, but a new constructor instead? Alternatively, we can defer the decision by not supporting iterator with QDM at the moment.
Discussions are welcomed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does sound quite confusing, but I don't think the python interface is the best reference here.
If I understand it correctly, the issue here is that there's two routes in this PR which give the same object type:
xgb.QuantileDMatrix
+ in-memory data.xgb.ExternalDMatrix
+xgb.DataIter
+as_quantile_dmatrix=TRUE
.
But if this were to be implemented the same way as in python (accepting xgb.DataIter
as data
instead of having a dedicated class for it), then there would be:
- A function
xgb.DMatrix
which could still produce different output types depending on its inputs (regular DMatrix vs. the 'ExternalDMatrix' introduced here). - A parameter
missing
which is auto-modified for integer inputs in some cases but not others. - Multiple parameters that can contradict each other (e.g. feature names).
If the idea is to avoid inconsistencies, then I think the best option would be to have only two functions:
xgb.DMatrix
, taking only in-memory data, and accepting a parameteras_quantile_dmatrix
in order to produce classxgb.QuantileDMatrix
as output.xgb.ExternalDMatrix
, taking onlyxgb.DataIter
as input, and also accepting a parameteras_quantile_dmatrix
.
Either that, or 4 functions after an 'ExternalQuantileDMatrix' gets implemented.
But I can also guess that:
- There will be much fewer people using
xgb.ExternalDMatrix
than the others. - People using
xgb.ExternalDMatrix
will likely be the more advanced users. - People using
xgb.ExternalDMatrix
will probably be more worried about getting the data iterators right and wondering about their usage than about other DMatrix properties.
Which is why I chose this kind of separation.
If there was an ExternalQuantileDMatrix type implemented, perhaps it'd make more sense to have 4 possible constructors, but I'm not sure about what's the best way to define these constructors and classes with the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I don't think the python interface is the best reference here
I agree. For background, the interface of Python is mostly caused by the use of text file input like data="path/to/csv"
, which has the same problem with the iterator with confusing parameters.
A parameter missing which is auto-modified for integer inputs in some cases but not others.
Not entirely sure what this means. I think as long as a missing value is accepted, integer will be considered as float.
Thank you for listing all the possible options, and apologies for the slow replies lately, will be more active starting tomorrow.
In general, I would still like to differentiate DMatrix
and QuantileDMatrix
since they have different functionalities and accept a different set of parameters, only QDM accepts max_bin
. Lastly, they have different data inside, one stores the raw data, while the other stores the histogram index. A single parameter as_quantile_dmatrix
might not be sufficient to reflect and clarify the difference, and my preference is still to have two different classes for them.
On the other hand, ExternalMemDMatrix
is a lot more similar to DMatrix
compared to QDM
. They have the same set of parameters, and can be used "almost" all the places where DMatrix can be used, share the same underlying data content, it's just the former stores it on disk while the latter stores it in memory. They are different for sure, and I agree with you that it should have a different class. My point is just that if ExternalMemory were to have a different class, we should definitely have a different class for QDM.
The question is how to integrate the concept of external memory into the mix. I propose the following functions and classes, ExternalMemoryDMatrix
as implemented in this PR, but without the support for being a QDM. In addition, a QDM constructor that accepts only the iterator xgb.QuantileDMatrix.from_iterator(iter) -> xgb.QuantileDMatrix
. So, one new class, and one new method to an existing class.
The reasoning behind this is that the xgb.QuantileDMatrix
and xgb.QuantileDMatrix.from_iterator
return the same type with the exact same underlying data (in c++), it's just how it's constructed in the first place and so there's no need for a new class in R. Then we use different constructor to represent the difference of construction process. As for the external memory version of DMatrix, as you have suggested, they behave differently, we will continue with your approach of creating a new class for it.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
So that'd leave the following functions with the following output classes:
xgb.DMatrix
->xgb.DMatrix
xgb.QuantileDMatrix
->xgb.QuantileDMatrix
xgb.QuantileDMatrix.from_iterator
->xgb.QuantileDMatrix
xgb.ExternalDMatrix
->xgb.ExternalDMatrix
Is that a correct understanding?
I think as long as a missing value is accepted, integer will be considered as float.
Not quite - integer matrices are also accepted, and are passed as such to the array_interface parser. In such case, the missing value is auto-adjusted in the C++ code according to the type of the input, but if the user passes a data iterator, the missing value is passed before the iterator to the first C function, and thus cannot be converted later on once the data is seen and the dtype is determined (with integer missing value in R being -INT_MAX
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a correct understanding?
Yes.
and thus cannot be converted later on once the data is seen and the dtype is determined
That's interesting, did not think of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
R-package/R/xgb.DMatrix.R
Outdated
#' # Function 'xgb.ProxyDMatrix' must be called manually | ||
#' # at each batch with all the appropriate attributes, | ||
#' # such as feature names and feature types. | ||
#' xgb.ProxyDMatrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we hide this function and pass a closure to the user like the Python code? I think it's easier to use if the user doesn't need to worry about an additional function and an opaque handle. R is a high level language and it would great if we can hide these C artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a closure is the best option here.
Another option could be to have the user return the result of this xgb.ProxyDMatrix
(or NULL
to signal end of batch), but let this be just a simple list and defer processing for later instead of processing things just after xgb.ProxyDMatrix
is called.
It would however forego the possibility of doing a manual clean up of whatever environment generated the data inside the iterator if the user wishes to do so immediately before the next batch, but perhaps it's a more logical design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a closure is the best option here.
Could you please elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a closure is the best option here.
Could you please elaborate on this?
They are akward to use, hard to document, and hard to examine as a user. Functions, environments and lists are more common and easier to use. Or do you think that a closure could bring some advantage here that would not be achieved by e.g. letting the user return the result of xgb.ProxyDMatrix
? (assuming that the handle is left out of the signature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have preference on whether a closure or a function or a class should be used (they are all the same), it's just I would like to hide/abstract many things as possible, and sometimes closure is the way to do it. If there are other approaches, I'm happy to take it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factored out the proxy_handle
parameter. Now the user needs to call a function xgb.ProxyDMatrix
inside of the callback (which doesn't take any handle) and return its result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! I'm excited to have this merged for the R package.
ref #9810
closes #9864
(note that it has some overlap with #9829 - whichever is merged first will create conflicts in the other)
This PR adds R wrappers over the following:
feature_types
in DMatrix construction.Additionally, it updates the docs for
feature_types
in the python interface in order to reflect what the current code does.