Skip to content
jhickeyPhysics edited this page Aug 11, 2016 · 17 revisions

C++ design

Overview of the system

The R GBM package allows users to train gradient boosting machines primarily for the purposes of prediction and feature selection. The package is built using RCpp and is composed of two parts: a R front-end that takes user inputs and performs pre-processing before passing a series of arguments to the second part; the C++ back-end. The main algorithmic work is done in the C++ layer, which on completion passes R objects back to the R front-end for postprocessing and reporting. This design document focusses on the C++ back-end, providing descriptions of its scope, functionality and structure; for documentation on the R API please see the R GBM package vignettes.

System Scope

The C++ back-end, from here on referred to as the "system", provides algorithms to: train a GBM model, make predictions on new data using such a fitted model and calculate the marginal effects of variable data within such a fitted model via "integrating out" the other variables. The main bulk of the C++ codebase focusses on the training of a GBM model, tasks involved in this process include: set-up of datasets, calculation of model residuals, fitting a tree to said residuals and determining the optimal node predictions for the fitted tree. The other pieces of functionality, that is prediction and calculating the marginal effects of variables, are singular functions with little or no developer designed classes associated with them.

Goals

The structure imposed on the C++ back-end was selected to:

  • Replace a legacy setup with one more amenable to change.

  • Make the structure and functionality of the system more transparent via appropriate OO design and encapsulation.

  • Implement design patterns where appropriate - in particular to simplify memory management through the application of the RAII idiom.

Role of the system in the GBM package

Within the GBM package the system performs the main algorithmic calculations. It undertakes all of the algorithmic work in training a gradient boosted model, making prediction based on new covariate datasets and calculating the marginal effects of covariates given a fitted model. It does not perform any formatting or postprocessing beyond converting the fitted trees, errors etc. to a R list object and passing it back to the front-end.

Viewpoints of the system

Due to the algorithmic nature of the system, the description of the system will focus on the behaviour of the system and its structure is captured by the Functional and Development views respectively. Within the package the system is located in the /src directory. To conclude this introduction, it should be noted that the system has been developed in line with Google C++ style.

Main Use Cases

The main use cases of the system are:

  • Training a gradient boosted model using data and an appropriate statistical model.
  • Predicting the response from new covariates using a trained model.
  • Calculating Marginal effects of data using a trained model.

Functional View - Behaviour of the system

The system is composed of numerous classes used almost exclusively for training along with an interface to the R front-end. This entry point is appropriately named gbmentry.cpp and contains 3 functions: gbm, gbm_pred and gbm_plot. These functions perform the training, prediction and marginal effects calculations described in the previous Section respectively. The functionality and dynamic behaviour of these methods will now be described.

Component diagram showing the interface between the R layer and the system. Figure 1. Component diagram showing the interface between the R layer and the system.

Training a GBM model

The procedure to train a gbm model follows a sequence of simple steps as shown in Figure 2. Upon entry to the gbm method the user specified parameters are converted to an appropriate GBM configuration objects, see datadistparams.h and treeparams.h. These configuration objects are then used to initialize the GBM engine, this component stores both the tree parameters, which define the tree to be fitted, and the data/distribution container. Upon initialization the GBM engine will also initialize an instance of an appropriate dataset class, a bagged data class and a distribution object from the GBM configuration object. The dataset object contains both the training and validation data; these sets live together in vector containers and to swap between them the pointer to the first element in the set of interest is shifted by the appropriate amount. The bagged data class defines which elements of the dataset are used in growing an individual tree; how these elements are chosen is distribution dependent. After this, the GbmFit object is initialized. This object contains the current fitted learner, the errors associated with the fit and the function estimate (a Rcpp numeric vector). The function estimate is set to the previous estimates if provided or an appropriate value for the distribution selected otherwise.

With the GBM engine and the fit object initialized, the algorithm loops over the number of trees to fit and performs the following actions:

  1. Create a tree object.
  2. Bag the data to use for fitting the tree.
  3. Calculate the residuals using the current estimate, bagged data and distribution object.
  4. Fit the tree to the calculated residuals - this is done by generating all possible splits for a subset of variables and choosing the split to minimize the variance of the residuals within nodes. The best splits for each node are cached so if a node is not split on this round, no future splitting of that node is evaluated.
  5. Set the terminal node predictions which minimize the error.
  6. Adjust the tree, so non-terminal nodes also have predictions.
  7. Update the estimate, for both training and validation sets, and calculate the: training errors, validation errors and the improvement in the out of bag error from the update.
  8. Wrap up the fitted tree and errors, along with a reference to the dataset, in a FittedLearner object (see FittedLearner.h).

The returned fitted learner object defines the "current fit" in the GbmFit object, which is used to update the errors. At the end of a single iteration, the tree and the associated errors are converted to a R List representation which is then output to the R layer once all the trees have been fitted.

Activity diagram for the process of training a GBM model. Figure 2. Activity diagram for the process of training a GBM model.

The bagging of the data from the training set and calculation of residuals is distribution dependent and so is incorporated in the system as appropriate methods in the distribution class and its container. The tree object is comprised of several components, the principle ones are: a root node, a vector of the terminal nodes and a vector mapping the bagged data to terminal nodes. A node splitter which generates and assigns the best potential split to each terminal node is used in the tree growing method. This structure is shown below in Figure 3.

Component diagram showing what the GBM Engine component is comprised of. Figure 3. Component diagram showing what the GBM Engine component is comprised of.

Predicting from a GBM model

Using a previously fitted GBM model, the system can predict the response of further covariate data. The R frontend passes the following variables to the gbm_pred function in gbmentry.cpp: the covariates for prediction, the number of trees to use to predict, the initial prediction estimate, the fitted trees, the categories of the splits, the variable types and a bool indicating whether to return the results using a single fitted tree or not.

The prediction for the new covariates is then initialized, either to the previous tree prediction or the initial estimate value from the model training if it is the first tree prediction to be calculating. The number of trees to be fitted is then looped over, within each iteration the observations are in the covariate dataset are also looped over. Each observation is initially in the root node, a counter which tracks the current node the observation is in is thus set to 0. This observation is then moved through the tree, updating its current node, until it reaches the appropriate terminal node. Once the algorithm reaches a terminal node, the prediction for that observation and tree is set to the final split value of the terminal node's parent in the tree. This process is repeated over all observations and for all trees, once completed the prediction vector is wrapped up and passed back to the R front-end, see Figure 4.

Activity diagram showing how the algorithm performs predictions on new covariate data. Figure 4. Activity diagram showing how the algorithm performs predictions on new covariate data.

Calculating the marginal effects of a variable

The final piece of functionality offered by the system is to calculate the marginal effects of a variable by "integrating" out the others. As well as a fitted gbm model and covariate data the user also specifies which variables they're interested in at the R layer. This method utilises a NodeStack class which is defined at the beginning of gbmentry.cpp, this is a simple class that defines a stack of pairs. These pairs contain the node index, that is what node in the tree we are looking at, and its weight.

With this in mind the method works as follows:

  1. the initial prediction is set to user specified initial values.
  2. the algorithm loops over the number of trees to fit.
  3. the algorithm then loops over the observations.
  4. in the observation loop it creates a root node, puts it on the stack and sets the observation's current node to be 0 with weight 1.0.
  5. it gets the top node off the stack and checks if it is split.
  6. (a) if it isn't split the predicted function for this variable and tree is set to the split value times the weight of the node.
  7. (b) if the node is non-terminal it checks if the split variable is in the user specified variables of interested. If so the observation is moved to the appropriate child node which is then added to the stack. If this is not the case, two nodes (an average of both left/right splits - this "integrates" out the other variables) are added to the stack and it returns to step 5.
  8. When the observation reaches a terminal node it gets another observation and repeats steps 5 & 6.
  9. With all trees fitted the predicted function is wrapped up and output.

Activity diagram for calculating the marginal effects of specific variables. Figure 5. Activity diagram for calculating the marginal effects of specific variables.

Structure of the system

The near entirety of the system is devoted to the task of training a gbm model and so this Section will focus on describing the classes and design patterns implemented to meet this end. Starting at a high level the primary objects are the "gbm engine" found in gbm_engine.cpp and the fit defined in gbm_fit.h. The component has the data and distribution container, see gbm_datacontainer.cpp, and a reference to the tree parameters (treeparams.h) as private members; this is shown in Figure 3. The "gbm engine" generates the initial function estimate and through the FitLearner method which run the methods of the gbm_datacontainer.cpp and tree.cpp to perform tasks such as growing trees and calculating errors/bagging data. The system is built on the RAII idiom and these container classes are initialized on contruction of this gbm engine object and released on its destruction. This idiom is applied across the system to simplify the task of memory management and further elicit appropriate object design. Beyond this, encapsulation and const. correctness are implemented where possible within the system.

Configuration Structures

The R layer passes a number of parameters/R objects to the system. These are converted to appropriate parameters and stored in simple objects for use in initializing the data and distribution container and tree objects. After construction of the gbm engine the DataDistParams class is not used again during training while the TreeParams object are stored within the engine and used to initialize trees for fitting on each iteration.

Dataset Class

The dataset class, dataset.h, is part of the data/distribution container and is initialized using the DataDistParams described in the previous Section. This class stores both appropriate R objects, e. g. the response is stored as RCpp::NumericMatrix, and standard C++ datatypes describing the data, e. g. unsigned long num_traindata_. To iterate over these R vectors/arrays the dataset object has appropriate pointers to these arrays which can be accessed via the corresponding getter methods. The R objects stored here contain both training and validation data, to access a particular dataset the appropriate pointers are shifted by the length of the dataset so they point to the beginning of the validation/training set. Most of this class is defined within the header as the number of calls to the dataset's getter functions means that inlining these methods has a significant impact on the system's overall performance. It also has a RandomOrder() method which takes the predictor variables and shuffles them, this is used during the tree growing phase where all splits are generated for a random subset of features in the data. Finally, an associated class is the "Bag"; this class contains a vector of integers (0/1's) called the "bag" which defines the training data to be used in growing the tree.

Distribution Classes

The distribution class, distribution.h, is an abstract class which defines the implementation any other distributions to be included in the system must follow. The distribution object performs numerous features using an instance of thedataset class defined in the previous Section. Importantly it constructs the bag for the data, this method appropriately titled BagData is only different for the pairwise distribution where data groupings affect the bagging procedure. Before constructing the bags, the distribution object needs to be initialized with the data. This initialization will take the data rows and construct a multi-map which maps those rows to their corresponding observation ids, often each row is an unique observation but this is not always the case. This multi-map ensures bagging occurs on a by-observation basis as opposed to a by-row basis, as the latter would result in overly optimistic errors and potentially a failure of the training method where data from a single observation could appear both in the bag and out of it. The distribution object is also responsible for calculating the residuals for tree growing, the errors in the training and validation sets, initializing the function estimate and calculating the out of bag error improvement on fitting a new tree. These methods are all accessed via its container class stored within the gbm engine object.

Diagram showing the distributions currently implemented within the system. Figure 6. Diagram showing the distributions currently implemented within the system.

The construction of the a distribution object is done using a dynamic factory pattern, see Figure 7. In this design, the constructors of the concrete distribution classes are private and they contain a static Create method. This method is registered to a map in the factory, distribution_factory.cpp, where the name of the distribution, a string, is its key. On creation of a gbm engine object, a distribution factory is created and the appropriate distribution is generated from the factory using the user selected distribution, a string which acts as key to the same map, and DataDistParams struct.

Dynamic Factory pattern used to control the creation of distribution objects. The distribution factory itself is a singleton. Figure 7. Dynamic Factory pattern used to control the creation of distribution objects. The distribution factory itself is a singleton.

To conclude this subsection, there are some concrete distribution specific details to describe. The pairwise distribution, CPairwise.h, uses several objects which specify the information ratio measure that this distribution uses in its implementation. These classes all inherit from an abstract IRMeasure class and are defined in the CPairwise.h and CPairwise.cpp files.

Diagram displaying the information ratio classes used by the pairwise distribution. Figure 8. Diagram displaying the information ratio classes used by the pairwise distribution.

The other distribution which has a unique design is the Cox partial hazards model. The Cox partial hazards model is a survival model whose implementation depends on whether the response matrix provided to the system is that of "censored" data or "start-stop" data. In essence, if the response matrix has 2 columns it is censored and if it has more than 2 it is start-stop. To incorporate this added complexity, the CCoxPH.h class follows a state pattern, whereby the Cox PH object contains a pointer to an object implementing the correct methods for a specific scenario and those objects contain a pointer to the Cox PH object. This design is shown in Figure 9.

The Cox Partial Hazards distribution employs a state design pattern to accommodate the dependency of its implementation on the response matrix. Figure 9. The Cox Partial Hazards distribution employs a state design pattern to accommodate the dependency of its implementation on the response matrix.

Trees and Nodes

The final component of the gbm engine to consider is the tree class defined in tree.h. As described in "Training a GBM model" subsection the tree class contains: a rootnode, of class CNode, a vector of pointers to terminal nodes, a vector assigning data to those terminal nodes, a node search object (see node_searcher.h) and various data describing the tree (such as its maximum depth etc.). The tree class has methods that grow the tree, reset it for another growth, prediction on validation data, adjusting the tree once predictions are assigned to the terminal nodes and outputting it in a suitable format for the R layer to use.

The most important part of this class is the method for the growing of the tree. A key component in growing a tree is identifying appropriate splits, this role is the responsibility of the node searcher object. The node searcher object stores the best split identified for each terminal node. It uses vectors of variable splitters (see vec_varsplitters.h and varsplitter.h) to determine the best splits generated for each node and for each variable. Within the GenerateAllSplits() method, one vector of NodeParams, see below, copies the current best splits stored in the node searcher object. The variables to split on are then looped over and a vector of variable splits, one for each terminal node, is initialized to calculate the best split for the current variable in each node. The best split for the current variable is extracted from this vector and used to update the copy of the best splits. At the end of this loop, the updated copy of best splits are then assigned to the best splits stored in the node search object. This design follows a MapReduce structure and allows for parallelisation of this search process; the details of this parallelisation (see parallel_details.h) are also stored as a private member in the node search class.

The node class itself contains pointers to its children and its methods, such as calculating predictions, are dependent on the type of split present. The node can either be terminal, so it has no split, or have a continuous/categorical split. To account for this dependency on the split, the node class is implemented with a state design pattern , similar to the Cox PH distribution, but in this instance it posseses a SetStrategy() method so the implementation of a node can change as a tree grows. The variable splitter objects contain "node parameters", see node_parameters.h, which encapsulate the properties of the proposed split nodes. These NodeParams objects use NodeDef structs, located in node_parameters.h, to define the proposed node splits and provides methods to evaluate the quality of the proposed splits. Finally, the varsplitter class has a state design pattern where on construction the methods for how the variable of interest should be split are set, see Figure 11.

State pattern for the node object, it possesses a "SetStrategy()" method so nodes can change type as the tree grows. Figure 10. State pattern for the node object, it possesses a "SetStrategy()" method so nodes can change type as the tree grows.

State pattern for the variable splitter object.  How it splits on a variable is set on construction. Figure 11. State pattern for the variable splitter object. How it splits on a variable is set on construction.

To conclude this Secion formal system class and interaction diagrams are presented in Figures 12 & 13 repsectively. These provide a complete and more detailed picture of the classes and their interactions which define the system, in particular the process of training a gbm.

System class diagram showing the classes and how they interface in the training method. Figure 12. System class diagram showing the classes and how they interface in the training method.

Sequence diagram for training a single tree when fitting a gradient boosted model. Figure 13. Sequence diagram for training a single tree when fitting a gradient boosted model.

R Layer Design

Overview of the System

The R GBM package consists of two parts: a R front-end that takes user inputs and pre-processes those inputs before passing a set of arguments to the second part, the C++ back-end, which does the main algorithmic calculations. The latter part of the package is described above; on completion of this calculation the results are passed back to the R layer for post-processing where it is wrapped into a model object and returned to the user for use in further calculations and reporting. This part of the wiki focusses on describing the R front-end, providing descriptions of its scope, functionality and structure - for further documentation on the R API and the functionality within, please see the R GBM package vignettes and in-built documentation.

System scope

The R front-end, which is referred to as the "system" for the remainder of this Section, provides an API that allows the user to: train a GBM model, make predictions on new data using a fitted mdoel, calculate and plot the marginal effects of variables on the model via "integrating out" the other variables as well as a whole host of other functionality such as additional fitting and ranking of the relative influence of variables within the boosted model. The primary purpose of the system is to take user inputs and pre-process them before calling the appropriate C++ functionality. The output from the back-end is then post-processed and the resulting output returned to the user for further manipulation and calculations. The primary task of fitting a GBM model has informed the design of the system and encompasses the most complex manipulations performed by the system. The other pieces of functionality provided by the system, are either purely functional in nature with no object orientated nature or rely upon the OO design tailored for the fitting a GBM model.

Goals

The structure imposed on the R front-end was selected to:

  • Replace a legacy setup with one more amenable to change and maintenance.
  • Make the functionality more modular and transparent via the use of appropriately design of S3 classes, methods and helper functions.
  • Implement input checks where possible to constrain the behaviour of the system to be well-defined and explicit.

Role of the system in the GBM package

The system acts as the interface between the user and the algorithmic calculator which is the back-end. It provides an API to access the C++ back-end and performs all of the necessary pre-processing and post-processing associated with training a gradient boosted model, making predictions based on new predictor datasets and calculating the marginal effects of covariates within a fitted model. Beyond this, the system provides a suite of additional functionality including visualisations of the fitted models and trees, summaries of the fitted model object, model performance evaluation and calculation of the loss of the response given the predictions.

Viewpoints of the system

The description of the system will focus on the Functional and Development views respectively. This is due to the functional nature of the R programming language and the S3 OO system employed within the system. The system is located in the /R directory with the GBM package and was developed in line with the Hadley Wickham R Good Practice style.

Main Use Cases under consideration

The main use cases of the system that will be considered are:

  • Pre-/Post-processing when training a gbm model using data and an appropriate statistical model.
  • Processing for predicting responses from new covariates using a trained model.
  • Evaluating the performance of a model.
  • Pre- and post- processing associated with calculating the marginal effects.

Functional view

The system is composed of a several classes which were created to elicit simple functionality and encapsulate data for the pre- and post-processing stages of model fitting. The main classes are as follows:

  • GBMDist - a distribution object which describes the model used in the algorithmic work.
  • GBMTrainParams - an object capturing the user specified training parameters passed to the R API.
  • GBMData - a data object constructed from the input data and model to enable manipulation of the data in preparation for model fitting or in other calculations.
  • GBMVarCont - an object storing the properties of the predictor variables passed as arguments to the model fitting routine.
  • GBMFit - a fitted gbm model.

The functionality and dynamic behaviour of the main use cases will now be described.

Training a gbm model - the R layer

The user can train a gbm model through the functions gbmt and gbmt_fit, the former function is a general purpose function that does some preprocessing before calling gbmt_fit, a more direct call to the C++ layer suitable for advanced users. The processes of these functions are shown in Figures 14. and 15. respectively. These functions take a number of arguments as inputs. In particular, the training parameters for fitting the model, e. g. the number of iterations, and the distribution are provided to the functions via GBMTrainParams and GBMDist objects respectively. These objects are created by calls to the training_params() and gbm_dist() constructors respectively, for more details on how this is done please see the vignettes within the package for details and examples. Upon entry into gbmt, the appropriate model parameters and predictor/response variables are extracted from the user inputs and the arguments checked to confirm their typing and validity. If possible the names of the predictor variables are extracted and stored to be passed later to gbmt_fit as an argument. The data provided may have specific groupings which have to be determined using the appropriately named S3 method - determine_groups. The training parameters are then updated accordingly to reflect any groupings within the data, specifically the number of training observations/groups to be used in the model fitting is updated, via the method update_num_train_groups(). After this update, the extracted model data and processed user inputs are passed to gbmt_fit.

Activity diagram for preparing data for training a gbm model and post-processing the output. Figure 14. Activity diagram for the pre- and post-processing associated with training a gbm model using gbmt.

Activity diagram for pre- and post-processing data when training a gbm model using gbmt_fit. Figure 15. Activity diagram for training a gbm model through gbmt_fit.

The first step in the gbmt_fit process is to check the inputs, this is because it is directly accessible for advanced users and not just through gbmt; however, these checks do not overlap with those in gbmt. From the predictor and response data provided, a GBMData object is created via the constructor gbm_data(). The predictor variable properties, such as the monotonicity of their relationship with the output, are then passed to gbm_var_container to construct the appropriate GBMVarCont object which stores these properties. If the distribution object created defines a CoxPH model then the appropriate strata and sorting matrices are created and stored in the GBMDist object. The GBMData object then has its categorical factors converted to appropriate numerics, is ordered according to any groupings and the observation id and the order the predictor observations should be examined in when growing trees in the C++ layer is determined. The observation id are also appropriately ordered and cross-validation fold id are assigned to each observation via GBMDist S3 method - create_cv_groups.

With the pre-processing complete, a GBMFit model is built using all of the training data available with a call to the back-end. This object contains the GBMTrainParams, GBMDist and GBMVarCont objects used to train it as well as the outputs from the back-end algorithm such as the fitted trees. Afterwards, if the user specified that the model was to be cross-validated: a series of models are built where the training data is split into training folds and testing folds before updating the GBMDist and GBMTrainParams objects and calling the C++ layer (see gbm in the C++ design Section of the Wiki). The outputs of this process along with the full model are stored in a GBMCVFit object which is effectively a list of GBMFit objects. The cross-validation error and fits are then calculated. The latter is simply an average of the outcomes from calls to the predict method applied to each individual cross-validated GBMFit model. The full model is then augmented with these errors and predictions (along with parallelisation details) if they were calculated and the full GBMData object if the user specified. The full model fit is reordered, so it is in the appropriate order, before being passed back to caller.

Back at the gbmt layer, the full GBMFit model plus cross-validated data is further augmented with details of the model call before returning the fitted model object to the user.

Processing for predictions

Using a previously constructed GBMFit object, the system provides the user with an API to predict the response of additional covariate data. This functionality is accessible through the S3 method predict which prepares the inputs before calling the C++ back-end to calculate the responses. The sequence of steps in this function is straightforward and displayed in Figure 16. Firstly, the user inputs are checked before the variable names are extracted from the fitted model, if those names are available, and applied to the new data. The named covariate data is then examined and any categorical data/factors are converted to appropriate numerics for the back-end. This function allows predictions from multiple iterations in the boosted fit can be calculated and so the next step is to ensure the requested iteration is within the scope of the fitted model. If a requested prediction iteration exceeds the number of previously fitted iterations the request is set to the maximum number of iterations used in constructing the boosted model. The requested iterations are then ordered before the back-end function gbm_pred is called to calculate the predictions. The outputted predictions are then converted to a matrix and their output scaled appropriately according to the GBMDist object within the model fit before being returned to the user.

Activity diagram for making predictions using a GBMFit object Figure 16. Activity diagram for making predictions using a GBMFit S3 object.

Evaluation of the performance of a model

The optimal iteration of a model produced by gbmt or gbmt_fit can be evaluated using the gbm_perf function provided by the system, this function is defined in gbm-perf.r. This function takes a GBMFit object along with the performance evaluation method as its primary inputs. The evaluation method specified can either be: "test", "cv" or "OOB", representing that the test set error, cross-validation error or out of bag improvement will be used to determine the optimal iteration. Initially, gbm_perf will check its inputs before identifying the best iteration according to the metric specified by the user. If specified, a graph of the various model errors against iteration number will then be produced before the method ends by returning the identified "best iteration" to the user.

Activity diagram for the evaluation of the performance of a trained gbm model Figure 17. Activity diagram for the evaluation of the performance of a trained gbm model.

Calculating marginal effects

Using a previously fitted GBMFit object, the system provides an API through the gbmt_plot function to calculate and plot the marginal influences of the predictor variables on the fitted model. When called the system will again check the user has provided the correct inputs. If the iteration at which the marginal influence is to be calculated exceeds the number in the original model fit then the requested iteration is set to the final iteration of the fit. The system only supports plotting of the marginal influences for up to 3 predictor variables, if more are examined then the return_grid parameter is set to TRUE and no plots will be displayed. After these checks, the grid of points at which the marginal influence of the selected predictors will be evaluated is constructed and the appropriate back-end function (in this case gbm_pred in gbmentry.cpp) is called. The output from this call is the marginal influence of each predictor that is then joined to the grid of evaluation points constructed earlier. If the return_grid parameter is set to TRUE then this new grid structure is returned to the user and the function terminates. If it is FALSE, then the categorical/factor predictor values (in the grid) are converted to appropriate numerics and the names of the predictor variables are set. The marginal influences are then plotted with the specific plot functionality determined on the form of the predictors, i. e. originally continuous or a factor, and number of predictors selected. The whole procedure undertaken by gbmt_plot is displayed in Figure 18.

Activity diagram for the R API associated with calculating marginal effects of predictors. Figure 18. Activity diagram for the process of calculating the marginal effects of predictors - from the R layer perspective.

System structure

As alluded to in the previous Section, the system is structured around five main S3 classes that define the distribution, data, training parameters and variable properties associated with the model fit as well as the model fit itself. The first objects required for fitting a GBM model define the training parameters and the model distribution to be used in the fit. These public constructor methods for these objects are defined in training-params.r and gbm-dist.r respectively. During the fitting process, appropriate container classes for the data and variable properties are constructed and are ordered/validated using the training parameters and distribution objects, before all of these objects are wrapped up in a single GBMFit object at the end of the process. Beyond encapsulating the appropriate data, this design allows for much of the packages functionality to be implemented as S3 methods and where this is not necessary the inputs are now more homogenous and thus simpler to validate. These classes will be discussed individually with reference to the full system class diagram shown in Figure 19.

Full system class diagram for the R layer of the GBM Package Figure 19. Full system class diagram for the R layer of the GBM package.

GBMDist Class

Much of the system's behaviour and functionality is dependent on the distribution selected by the user when training a gbm model. To reflect this the system contains a GBMDist class around which much of the systems functionality is built. The GBMDist class itself is an abstract class that cannot be constructed directly within the system. Instead a call to gbm_dist constructor will create a specific distribution object that inherits from the GBMDist class, see Figure 20. This constructor is a wrapper constructor that firstly creates an empty distribution object of the appropriate derived class and then uses a S3 method, create-distributions.r, to construct the full distribution object which contains associated distribution specific data. This specific data is provided by the user through the call to gbm_dist and, along with this data, the constructed object will contain an appropriately populated name field .

Class diagram of the S3 distribution classes used in the R layer Figure 20. Class diagram of the S3 distribution classes used in the R layer.

The majority of functionality within the system is dependent on the distribution selected and this is reflected in the S3 nature of the system as well as switch statements on the objects name field that appear in simpler control flows. Notable examples of this dependence include: the implementation of strata (only for "CoxPH") as well as the validation of the response and ordering of the data, furthermore the Bernoulli distribution offers stratification of responses during cross-validation. The distribution object also determines how groupings in data are handled, if it all (this is only relevant if "Pairwise" is selected), and the extraction of distribution specific data is also distribution specific (this is handled by the internal method get_misc()). Beyond this, lots of helper functions within the package are dependent on the constructed distribution object, examples include: determining the labels of various performance plots and the calculating the appropriate scaling of the fitted model predictions. The primary S3 methods associated with this class are shown along with the distribution dependent data in Figure 19.

GBMTrainParams Class

Model parameters that define how the model will be trained are stored within the GBMTrainParams object which can be constructed by calling the training_params function (see training-params.r). This is a simple data object that is primarily used for storage of important GBM model fitting parameters such as: the number of boosted trees to fit, the maximum depth of each fitted tree, the number of training observations and the observation ids. The construction of the object takes a list of named parameter inputs and then simply stores them in a named argument list. During the construction process, the number of rows per observation along with the id order are determined and stored within the object. The GBMData has its predictors and response ordered according to the observation id order as part of the pre-processing in gbmt_fit. Post model-fitting, this object is stored as a field in GBMFit and is used to check inputs to other functions or control process flow as well as provide arguments for further calculation, methods such as gbm_more() illustrate the latter use.

GBMData Class

The GBMData class is another simple object that is created by passing the covariate data, response data along with the response weights and offsets to the constructor defined in gbm-data.r. This constructor is a not accessible to the user and is instead called during the gbmt_fit pre-processing. Once created, the pre-processing in gbmt_fit converts categorical data stored within the object before ordering it ordered according to any groups present and the observation id. The order of the predictors is then determined and stored in the x_order field within the object. The GBMData object then plays a key role in the construction of the appropriate strata and the ordering of observations for CoxPH boosted models. If cross-validation is undertaken the data within the object is split and re-joined. The object is unpacked to supply appropriate arguments the call to the back-end. It is an object that can be stored, optionally, within the fitted model for access later so the user does not need to re-supply data but its role within the system is a primarily as a container during the fitting process.

GBMVarCont Class

This class is also a simple container class defined in variable-properties.r that stores data associated with the covariates used for fitting. This class cannot be constructed directly by the user and instead is produced through the model fitting process. It is initialized during the pre-processing of gbmt_fit where the variable names along with the monotonicity and GBMData object are used to initialize it. During contruction, the covariate variable type and number of levels (0 if continuous and a positive integer otherwise) associated with each variable are determined and stored in the object. The object is not used further in the process until its properties are unpacked as arguments to the back-end C++ call. After fitting, it is stored in the GBMFit object and is accessed by other functionality to extract predictor variable information, typical information required by other functions is the total number of variables used in the fit, see gbm-summary.r, or the type of variable each covariate is. This class has no S3 methods associated directly with it, this is shown in Figure 19.

GBMFit Class

The GBMFit class contains the output from the call to the gbm fitting functionality in C++ back-end, along with associated training parameters, variable containers and distribution classes used in the model fitting process. It also contains details of the user call, cross validation details (such as the number of folds used) and the parallelization parameters used in its construction. The output of a call to gbmt_fit this object is primarily used in further calculations by the user. The user may evaluate the models performance via gbm_perf as well as make predictions on new predictor data through its predict method. Additional boosted iterations may be added to the model through the gbm_more method and the marginal influences of the various predictors can be calculated by passing the object to gbmt_plot. Finally, it has an interact method that calculates the relevance of interactions between predictors in the fit as well as print and summary methods to display the model object. Outside of these cases, the object can be unpacked and its constituent data objects used in other functions within the system, i . e. the loss method is associated with the fit's GBMDist object and the model predictions.

Parellelization of algorithm

When fitting a model it should be noted that gbmt and gbmt_fit provide an API for the parallelization of the underlying tree boosting code. This parallelization is achieved using OpenMP and can be activated by passing a gbmParallel object, defined in gbm-parallel.r, to the par_details argument in gbmt and gbmt_fit. More specifically, this object stores data in appropriately named fields that set the number of thread to be used and the size of "chunks" to be used in array scans by the back-end during fitting. It should be noted that some experimentation as to the optimal configuration for a given set-up may be required.

Deployment

To use the GBM package it is necessary to install a version of R on your system, preferably 3.2 or higher. With R installed, the GBM package may be installed by opening up a R terminal and using the install.packages("gbm") command to install the package.

Testing

The system is subject to two types of testing: black-box system/acceptance tests and unit testing. These tests are designed to run with the testthat package and ship with the GBM package in /tests/ folder. The package can be checked using R CMD check <built package> which will automatically run all of the tests within it and report any warnings or errors.

Current Coverage

The package has a series of unit and black box system tests which check that the package is operating as expected. The R layer of the package has tests for nearly every piece of functionality. These tests check: the piece of functionality's response to undesired inputs as well as its outputs when correct inputs are provided. This ensures that all of the "work" and functionality provided by the R layer, for every distribution available, is behaving as expected. On the other hand, the black box system tests check the overall behaviour of the package under using a variety of 'sanity test cases'. These tests primarily focus on the Gaussian, Bernoulli and CoxPH distributions but additional tests, such as checking the effects of the offset parameter, touch all distributions implemented in the package. Using the covr package, the travis CI server automatically collects code coverage data and sends it to coveralls.io. As of 11/08/2016 this coverage is at 91.57%, with high coverage levels for both the R and C++ layers. This coverage statistic was taken directly from the Travis-CI build and may not have transferred correctly to the badge on the repository's main page.

It should be noted that this coverage tool is not perfect and tends to underestimate the code coverage. In particular, when working on the underlying C++ layer, it tends to miss header files and inlined code which are exercised. The tests do miss some node functionality such as PrintSubTree and GetVarRelativeInfluence and they do not touch many unhappy/error throwing paths in the underlying C++. That being said, even without direct tests on the underlying C++, most of the critical paths are exercised with the tests provided.

Finally, the C++ layer itself can now be tested using R's testthat package thanks to a recent update. These unit tests have yet to be written due to high coverage obtained through appropriate R API calls but they would provide further re-assurance that each component in the C++ layer is behaving as expected.