[R] Add data iterator, quantile dmatrix, and missing `feature_types` #9913

david-cortes · 2023-12-20T19:51:46Z

(note that it has some overlap with #9829 - whichever is merged first will create conflicts in the other)

This PR adds R wrappers over the following:

Data iterators, for building DMatrix objects from external memory.
QuantileDMatrix, built using the same data iterator from the R side.
Parameter feature_types in DMatrix construction.

Additionally, it updates the docs for feature_types in the python interface in order to reflect what the current code does.

R-package/R/xgb.DMatrix.R

trivialfis · 2024-01-14T05:38:00Z

R-package/NAMESPACE

@@ -32,6 +32,10 @@ export(slice)
 export(xgb.DMatrix)
 export(xgb.DMatrix.hasinfo)
 export(xgb.DMatrix.save)
+export(xgb.DataIter)
+export(xgb.ExternalDMatrix)
+export(xgb.ProxyDMatrix)


I'm not sure if we need to export the proxy DMatrix, users are unlikely to interact with it.

That's the function that needs to be called inside the data iterator. The class constructor is not exported.

trivialfis · 2024-01-14T05:39:13Z

R-package/NAMESPACE

@@ -32,6 +32,10 @@ export(slice)
 export(xgb.DMatrix)
 export(xgb.DMatrix.hasinfo)
 export(xgb.DMatrix.save)
+export(xgb.DataIter)
+export(xgb.ExternalDMatrix)


Let's not expose this as an independent class for now. One day we might support external storage with QuantileDMatrix.

What do you think if we make the data parameter of DMatrix to be compatible with the iterator?

If QuantileDMatrix is a different class from from DMatrix, I think it does make sense to have an ExternalDMatrix as a different class, since just like QDM, it behaves differently from a regular DMatrix.

If implemented through xgb.DMatrix by passing an iterator as first argument, I think the function signature would be quite unintuitive, since for example label would be an argument to both the iterator's ProxyDMatrix and to DMatrix.

Would also leave the question about what happens if e.g. the user passes some features names in the iterator, and then different names in DMatrix, and so on.

That makes sense, I have been struggling with this since the first review and have been stalling the decision (my bad). Will look closer in the new class.

R-package/R/xgb.DMatrix.R

trivialfis · 2024-01-22T19:46:57Z

R-package/R/xgb.DMatrix.R

+#' with a value of zero when the stream of data ends (all batches in the iterator have been consumed),
+#' or an integer with a positive value when there are more batches in the line to be consumed.


I think this is one of the paper cuts I made to XGBoost, I think it's better to use boolean value instead of 0 and 1. At the time I was dealing with the C callbacks and the need to create appropriate wrappers skipped my mind.

trivialfis · 2024-01-22T19:57:52Z

R-package/R/xgb.DMatrix.R

+#' \item If created as a regular 'DMatrix', the data is accessed on-demand as needed, multiple
+#' times, without being concatenated (but note that fields like 'label' \bold{will} be concatenated
+#' from multiple calls to the data iterator).
+#' \item If created as a 'QuantileDMatrix', the quantized representation of the full data will
+#' be created in memory, concatenated from multiple calls to the data iterator. The quantized
+#' version is typically lighter than the original data, so there might be cases in which this
+#' representation could potentially fit in memory even if the full data doesn't.
+#' }


I remain wary about the interface here. We currently have:

DMatrix

QuantileDMatrix

DMatrix + Iter => External Memory training

QuantileDMatrix + iter => Instance construction != External memory training.

As you are already aware, the last option doesn't fetch data from external memory during training, it's more an option for distributed frameworks that manage data by chunks. In XGBoost, when we talk about external memory, we usually specifically mean training, which is fetching data on demand.

(Potential future) Quantile + iter => External memory training.

I think there's a need to clarify these types (and their names). I agree with the preference that external memory DMatrix can have its type, but mixing it with QDM seems a bit confusing. From a Python/C++ programmer's perspective, maybe we don't need a new type for it, but a new constructor instead? Alternatively, we can defer the decision by not supporting iterator with QDM at the moment.

Discussions are welcomed!

Yes, it does sound quite confusing, but I don't think the python interface is the best reference here.

If I understand it correctly, the issue here is that there's two routes in this PR which give the same object type:

xgb.QuantileDMatrix + in-memory data.

xgb.ExternalDMatrix + xgb.DataIter + as_quantile_dmatrix=TRUE.

But if this were to be implemented the same way as in python (accepting xgb.DataIter as data instead of having a dedicated class for it), then there would be:

A function xgb.DMatrix which could still produce different output types depending on its inputs (regular DMatrix vs. the 'ExternalDMatrix' introduced here).

A parameter missing which is auto-modified for integer inputs in some cases but not others.

Multiple parameters that can contradict each other (e.g. feature names).

If the idea is to avoid inconsistencies, then I think the best option would be to have only two functions:

xgb.DMatrix, taking only in-memory data, and accepting a parameter as_quantile_dmatrix in order to produce class xgb.QuantileDMatrix as output.

xgb.ExternalDMatrix, taking only xgb.DataIter as input, and also accepting a parameter as_quantile_dmatrix.

Either that, or 4 functions after an 'ExternalQuantileDMatrix' gets implemented.

But I can also guess that:

There will be much fewer people using xgb.ExternalDMatrix than the others.

People using xgb.ExternalDMatrix will likely be the more advanced users.

People using xgb.ExternalDMatrix will probably be more worried about getting the data iterators right and wondering about their usage than about other DMatrix properties.

Which is why I chose this kind of separation.

If there was an ExternalQuantileDMatrix type implemented, perhaps it'd make more sense to have 4 possible constructors, but I'm not sure about what's the best way to define these constructors and classes with the current behavior.

but I don't think the python interface is the best reference here

I agree. For background, the interface of Python is mostly caused by the use of text file input like data="path/to/csv", which has the same problem with the iterator with confusing parameters.

A parameter missing which is auto-modified for integer inputs in some cases but not others.

Not entirely sure what this means. I think as long as a missing value is accepted, integer will be considered as float.

Thank you for listing all the possible options, and apologies for the slow replies lately, will be more active starting tomorrow.

In general, I would still like to differentiate DMatrix and QuantileDMatrix since they have different functionalities and accept a different set of parameters, only QDM accepts max_bin. Lastly, they have different data inside, one stores the raw data, while the other stores the histogram index. A single parameter as_quantile_dmatrix might not be sufficient to reflect and clarify the difference, and my preference is still to have two different classes for them.

On the other hand, ExternalMemDMatrix is a lot more similar to DMatrix compared to QDM. They have the same set of parameters, and can be used "almost" all the places where DMatrix can be used, share the same underlying data content, it's just the former stores it on disk while the latter stores it in memory. They are different for sure, and I agree with you that it should have a different class. My point is just that if ExternalMemory were to have a different class, we should definitely have a different class for QDM.

The question is how to integrate the concept of external memory into the mix. I propose the following functions and classes, ExternalMemoryDMatrix as implemented in this PR, but without the support for being a QDM. In addition, a QDM constructor that accepts only the iterator xgb.QuantileDMatrix.from_iterator(iter) -> xgb.QuantileDMatrix. So, one new class, and one new method to an existing class.

The reasoning behind this is that the xgb.QuantileDMatrix and xgb.QuantileDMatrix.from_iterator return the same type with the exact same underlying data (in c++), it's just how it's constructed in the first place and so there's no need for a new class in R. Then we use different constructor to represent the difference of construction process. As for the external memory version of DMatrix, as you have suggested, they behave differently, we will continue with your approach of creating a new class for it.

What do you think?

Sounds good.

So that'd leave the following functions with the following output classes:

xgb.DMatrix -> xgb.DMatrix

xgb.QuantileDMatrix -> xgb.QuantileDMatrix

xgb.QuantileDMatrix.from_iterator -> xgb.QuantileDMatrix

xgb.ExternalDMatrix -> xgb.ExternalDMatrix

Is that a correct understanding?

I think as long as a missing value is accepted, integer will be considered as float.

Not quite - integer matrices are also accepted, and are passed as such to the array_interface parser. In such case, the missing value is auto-adjusted in the C++ code according to the type of the input, but if the user passes a data iterator, the missing value is passed before the iterator to the first C function, and thus cannot be converted later on once the data is seen and the dtype is determined (with integer missing value in R being -INT_MAX).

Is that a correct understanding?

Yes.

and thus cannot be converted later on once the data is seen and the dtype is determined

That's interesting, did not think of that.

trivialfis · 2024-01-22T20:03:06Z

R-package/R/xgb.DMatrix.R

+#'   # Function 'xgb.ProxyDMatrix' must be called manually
+#'   # at each batch with all the appropriate attributes,
+#'   # such as feature names and feature types.
+#'   xgb.ProxyDMatrix(


Is it possible that we hide this function and pass a closure to the user like the Python code? I think it's easier to use if the user doesn't need to worry about an additional function and an opaque handle. R is a high level language and it would great if we can hide these C artifacts.

I don't think a closure is the best option here.

Another option could be to have the user return the result of this xgb.ProxyDMatrix (or NULL to signal end of batch), but let this be just a simple list and defer processing for later instead of processing things just after xgb.ProxyDMatrix is called.

It would however forego the possibility of doing a manual clean up of whatever environment generated the data inside the iterator if the user wishes to do so immediately before the next batch, but perhaps it's a more logical design.

I don't think a closure is the best option here.

Could you please elaborate on this?

I don't think a closure is the best option here.

Could you please elaborate on this?

They are akward to use, hard to document, and hard to examine as a user. Functions, environments and lists are more common and easier to use. Or do you think that a closure could bring some advantage here that would not be achieved by e.g. letting the user return the result of xgb.ProxyDMatrix ? (assuming that the handle is left out of the signature)

I don't have preference on whether a closure or a function or a class should be used (they are all the same), it's just I would like to hide/abstract many things as possible, and sometimes closure is the way to do it. If there are other approaches, I'm happy to take it.

Factored out the proxy_handle parameter. Now the user needs to call a function xgb.ProxyDMatrix inside of the callback (which doesn't take any handle) and return its result.

trivialfis

Excellent work! I'm excited to have this merged for the R package.

add data iterators and quantile dmatrix

5982617

david-cortes mentioned this pull request Dec 20, 2023

[Draft] [R] Add QuantileDMatrix creation from dense matrices #9864

Closed

david-cortes added 3 commits December 20, 2023 21:17

update docs

910b359

linter

c45ad77

solve merge conflicts

bc7f5d3

trivialfis reviewed Dec 21, 2023

View reviewed changes

R-package/R/xgb.DMatrix.R Outdated Show resolved Hide resolved

david-cortes added 7 commits December 21, 2023 18:09

rename DataIterator -> DataIter

12fa3b6

print info for proxy dmatrix

0080b26

solve merge conflicts

ed972ad

use namespace instead of static

e06d7cb

solve merge conflicts

f704866

solve conflicts

96d1bdc

update roxygen

77155e1

trivialfis reviewed Jan 14, 2024

View reviewed changes

correct note about QDM support

bc1c66e

david-cortes mentioned this pull request Jan 20, 2024

[R] Use inplace predict #9829

Merged

Merge branch 'master' into dataiter

b012be6

trivialfis reviewed Jan 22, 2024

View reviewed changes

david-cortes added 5 commits January 29, 2024 19:53

Merge branch 'master' into dataiter

bf1fbba

separate functions for external data

b897b46

remove exposure of proxy matrix handle

18bd3e2

linter

a18dcfd

update roxygen

57077f4

trivialfis approved these changes Jan 30, 2024

View reviewed changes

trivialfis merged commit 3abbbe4 into dmlc:master Jan 30, 2024
26 of 27 checks passed

david-cortes mentioned this pull request Jan 30, 2024

[R] Rename ProxyDMatrix -> DataBatch #10016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Add data iterator, quantile dmatrix, and missing `feature_types` #9913

[R] Add data iterator, quantile dmatrix, and missing `feature_types` #9913

david-cortes commented Dec 20, 2023 •

edited

Loading

trivialfis Jan 14, 2024

david-cortes Jan 14, 2024

trivialfis Jan 14, 2024

david-cortes Jan 14, 2024

trivialfis Jan 22, 2024

trivialfis Jan 22, 2024

trivialfis Jan 22, 2024 •

edited

Loading

david-cortes Jan 22, 2024

trivialfis Jan 23, 2024 •

edited

Loading

david-cortes Jan 23, 2024

trivialfis Jan 24, 2024

david-cortes Jan 29, 2024

trivialfis Jan 22, 2024

david-cortes Jan 22, 2024

trivialfis Jan 23, 2024

david-cortes Jan 23, 2024

trivialfis Jan 24, 2024 •

edited

Loading

david-cortes Jan 29, 2024

trivialfis left a comment

		#' with a value of zero when the stream of data ends (all batches in the iterator have been consumed),
		#' or an integer with a positive value when there are more batches in the line to be consumed.

[R] Add data iterator, quantile dmatrix, and missing feature_types #9913

[R] Add data iterator, quantile dmatrix, and missing feature_types #9913

Conversation

david-cortes commented Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

[R] Add data iterator, quantile dmatrix, and missing `feature_types` #9913

[R] Add data iterator, quantile dmatrix, and missing `feature_types` #9913

david-cortes commented Dec 20, 2023 •

edited

Loading

trivialfis Jan 22, 2024 •

edited

Loading

trivialfis Jan 23, 2024 •

edited

Loading

trivialfis Jan 24, 2024 •

edited

Loading