Skip to content
Tom SF Haines edited this page Mar 30, 2016 · 1 revision

Fast Random Forest

Overview

Fast Random Forest

A straight random forest implementation created for when I just want a classifier or a regressor. Not designed for flexibility like my other two implementations in other words, though still plenty powerful. Was actually created in frustration at the scikit learn one - I found myself spending more time loading the model from disk (because scikit learn requires the use, at least at the time of writing, of the awful Python pickle system) than actually using it. 'Fast' actually refers to its ability to load a model really quickly - I designed it so the in-memory and on-disk data layouts are identical, so loading is a straight copy into memory with no clever logic required (each Tree in the model provides the memoryview interface, with full read/write support!).

Explore the test files to see use cases. Typical usage is to create a Forest() object, then call the configure method. The configure method is probably the most fiddly bit - it defines the inputs and outputs (you can have multiple outputs, though that's generally not useful) using three strings of codes (one character per code), where the codes are in the documentation/provided by the info.py script. The first string specifies the summary type, which is what is being learnt for each output. For instance 'C' means one categorical output, which would typically be used for a classification forest. The second string specifies what it is greedily optimising when learning, one code per output (first and second string must be same length). 'C' for this string would mean one output, categorical, for which the system has an entropy based objective. This separation is so you can have different objectives with the same output type, though only entropy ones are provided at this time. The final string tells the system how it can use the inputs to the random forest - effectively the kinds of test to generate for each input feature when deciding which branch to go down. 'OSS' would be a length three feature vector where the first is categorical, for which it uses one vs all tests, and the second and third are both real, for which it generates split tests based on a comparison. The Forest object also has a load of variables, which control things like maximum tree depth.

After the Forest is setup the train(x, y, # of trees to add) method will add trees. Be aware that tree objects can be moved from one Forest object to another and serialised - this is so learning using multiple cores is trivial (You can serialise the Forest object as well, so you only have to configure it once!). This method can be called repeatedly, to keep adding trees. Data set does not have to be the same each time - usually that would be used for incremental learning, where you train new trees with the extra data, then cull trees with poor OOB performance. The train method returns the OOB. Finally, once a Forest is trained the predict(x) method will return the predictions for the given data matrix. Note that the entire system support passing in tuples/lists of data matrices (each of which is a 2D numpy arrays), so you can have both discrete (int) and real (float) features at the same time. You can also weight the exemplars. The Forest and Tree object additionally have loads of extra methods for diagnostics, configuration and i/o - see documentation for details.

I/O is one of the strong points of the system - see the save_forest and load_forest functions in frf.py for examples of how it works.

If you are reading readme.txt then you can generate documentation by running make_doc.py

Contains the following key files:

frf.py - The file a user imports - provides the Forest class, the Tree class (can be ignored) and two methods for file i/o.

info.py - Dynamically generated information about the summary, information and learner types available to the system.

test_*.py - Some test scripts.

readme.txt - This file, which is included in the html documentation.

make_doc.py - Builds the html documentation.


Summary types

Nothing code = N A summary type that does nothing - does not in any way summarise the feature index it is assigned to. For if you either have a multi-index summary type on an earlier feature, and hence don't need to summarise this feature index twice, or have some excess feature in your data structure and just want to ignore it.

Categorical code = C A standard categorical distribution for discrete features. The indices are taken to go from 1 to the maximum given by the datamatrix, inclusive - any value outside this is ignored, effectivly being treated as unknown. Output when converted to a python object is a dictionary - the key 'count' gets the number of samples that went into the distribution (flaot, due to weighting), whilst 'prob' gets an array, indexed by category, of the probabilities of each. For the array case the count gets a 1D array and cat becomes 2D, indexed [exemplar, cat]. The error calculation is simply zero for most probable value matching, feature weight for it not matching.

Gaussian code = G Expects continuous valued values, which it models with a Gaussian distribution. For output it dumps a dictionary - indexed by 'count' for the number of samples that went into the calculation (float as weighted), 'mean' for the mean and 'var' for the variance. For a single sample these will go to standard python floats, for an array evaluation to numpy arrays. When returning errors it returns the absolute difference between the mean and actual value, weighted

BiGaussian code = B A bivariate verison of Gaussian - uses the given feature index and the next one as well. Same output format-ish, except you get a length 2 array for mean and a 2x2 array indexed by 'covar' intead of the var entry, with one variable, and those with the extra dimension for the array version. Error is the Euclidean distance from the mean.

Information types

Nothing code = N Does nothing - always returns an entropy of zero, to say we are not encoding this information. Can be used to hide an output channel, either because you don't want it included or another channel is already using it.

Categorical code = C Provides the entropy of a categorical distribution for the optimisation subsystem - the default for discrete distributions.

Gaussian code = G Given the entropy of a Gaussian distribution over the data - default for continuous data.

BiGaussian code = B Gaussian of two variables - first is the feature index its at, the second is the next one along - don't put in the last slot as that will go wrong.

Learner types

Idiot code = I A learner that can't learn - will never return a test and hence the system will ignore the feature its assigned to. Included for completeness - actual use cases all involve laziness.

Split code = S The standard feature splitting approach - finds the split point that maximises the information gain - default for continuous variables.

OneCat code = O For discrete features - one category gets to pass, all others fail, optimised to maximise information. Default for discrete features.

Functions

save_forest(fn, forest) Saves a forest - the actual Forest interface is fairly flexible, so this is just one way of doing it - streams the Forest header followed by each Tree into a bzip2 compressed file. Suggested extension is '.rf'. The code for this is in Python, and forms a good reference if you need to write your own i/o for this module.

load_forest(fn) Loads a forest that was previous saved using the save_forest function. The code for this is in Python, and forms a good reference if you need to write your own i/o for this module.

Classes

Forest(object)

A random forest implimentation, designed with speed in mind, as well as I/O that doesn't suck (almost - should be 32/64 bit safe, but not endian change safe.) so it can actually be saved/loaded to disk. Remains fairly modular so it can be customised to specific use cases. Supports both classification and regression, as well as multivariate output (Including mixed classification/regression!). Input feature vectors can contain both discrete and continuous variables; kinda supports unknown values for discrete features. Provides the sequence interface, to access the individual Tree objects contained within. Note that is not thread safe - use multiprocessing if parallism is required, for which it has good support.

append(?) Appends a Tree to the Forest, that has presumably been trained in another Forest and is now to be merged. Note that the Forest must be compatible (identical type codes given to configure), and this is not checked. Break this and expect the fan to get very brown.

clear(?) Removes all trees from the forest. Note that because you can't snipe individual trees if you want to be selective you can use the list interface to get all of them, clear the forest then append each tree that you want to keep.

clone(?) Returns a new Forest with the exact same setup as this one, but no trees. Be warned that it also duplicates the key for the random number generator, so any new trees trained with the same data in both will be identical - may want to change the key after cloning.

configure(?) Configures the object - must be called before any learning, unless load is used instead. Takes three tag strings - first for summary, next for info, final for learn. Summary and info must have the same length, being the number of output features in length, whilst learn is the length of the number of input features. See the various _list static methods for lists of possible codes. Will throw an error if any are wrong. It can optionally have two further parameters - an array of category counts for x/input and an array of category counts for y/output - these are used when building categorical distributions over a feature to decide on the range, which will be [0,cateogory). Values outside the range will be treated as unknown. Negative values in these arrays are ignored, and the system reverts to calculating them automatically.

error(?) Given a x/input data matrix and a y/output data matrix of true answers (Same as train) this returns an array, indexed by output feature, of how much error exists in that channel. Same as the oob calculation, but using all trees and therefore for a hold out set etc. If you want a weighted output then it should be provided in the y data matrix - any weights in x will be ignored.

importance(?) Returns the importance of each feature as calculated during trainning for every tree currently in the forest. This is a new numpy vector indexed by feature that gives the information gain obtained from splits on that feature, weighted by the number of trainning exemplars that went through that split. Note that this is different from the tree version of this method, as it divided through by the number of exemplars, so the weighting is one only for the very first split, and then averages the vectors provided by all of the trees. This gives a metric which is average information gain (in nats, or whatever the training objective uses) provided by the feature per exemplar, though most people then normalise the entire vector to get a relative feature weighting.

info_list(?) A static method that returns a list of information (as in entropy) types, as dictionaries. The information types define the goal of a split optimisation procedure - for any split they give a number for performing that split, typically entropy, that is to be minimised. This is on a per output feature basis. Each dictionary contains 'code', one character string, for requesting it, a long form 'name' and a 'description'.

initial_size(?) Returns the size of a forests initial header, so you can load that to get basic information about the forest, then load the rest of the header then the trees.

learner_list(?) A static method that returns a list of split learner types, as dictionaries. The learner types optimise and select a split for an input feature. Each dictionary contains 'code', one character string, for requesting it, a long form 'name' and a 'description'. Also contains 'test', a one character string, of the kind of test it generates, not that this is of any use as its internal use only.

load(?) Given an entire header (See initial_size and size_from_initial for how to do this) as a read-only buffer compatible object this initialises this object to those settings. If there are any trees they will be terminated. Can raise a whole litany of errors. Returns how many trees follow the header - it is upto the user to then load them from whatever stream is providing the information.

max_x(?) Returns an array of maximum values (integers) indexed by x/input feature - these are used for building categorical distributions over that feature, which will go from 0..max_x inclusive. The array can include negative values, which indicate unknown (calculate when needed from the data)/irrelevant (continuous).

max_y(?) Returns an array of maximum values (integers) indexed by y/output feature - these are used for building categorical distributions over that feature, which will go from 0..max_y inclusive. The array can include negative values, which indicate unknown (calculate when needed from the data)/irrelevant (continuous).

predict(?) Given an x/input data matrix (With support for a tuple of matrices identical to train.) returns what it knows about the output data matrix. Return will be a list indexed by feature, with the contents defined by the summary codes (Typically a dictionary of arrays, often of things like 'prob' or 'mean'). You can provide a second parameter as in exemplar index if you want to just do one item from the data matrix, but note that this is very inefficient compared to doing everything at once in a single data matrix (Or several large data matrices if that is unreasonable).

save(?) Returns the header for this Forest, such that it can be saved to disk/stream etc. and later loaded. Return value is a bytearray

set_ratios(?) Sets the ratios to use when balancing the priority of learning each output feature - must be a 2D numpy array with the first dimension indexed by depth, the second indexed by output feature. The depth is indexed modulus its size, so you can have a repeating structure. Can only be called on a ready Forest, and keeps a pointer to the array, so you can change its values afterwards if you want. Note that the ratios effect the feature importance measure - if you are using that its probably wise to normalise each set of depth ratios, so the contributions from each level are comparable.

size_from_initial(?) Given the inital header, as a read-only buffer compatible object (string, return value of read(), numpy array.) this returns the size of the entire header, or throws an error if there is something wrong.

summary_list(?) A static method that returns a list of summary types, as dictionaries. The summary types define how a particular target variable leaf summarises the exemplars that land in it, per output feature, and in what form that is returned to the user. Each dictionary contains 'code', one character string, for requesting it, a long form 'name' and a 'description'.

train(?) Trains and appends more trees to this Forest - first parameter is the x/input data matrix, second is the y/output data matrix, third is the number of trees, which defaults to 1. Data matrices can be either a numpy array (exemplars X features) or a list of numpy arrays that are implicity joined to make the final data matrix - good when you want both continuous and discrete types. When a list contains 1D arrays they are assumed to be indexed by exemplar. The list can also contain a tuple, ('w', 1D vector), which will contain a weight for each exemplar, as in how many exemplars it counts as - good for imbalanced data. Note that only a weight in y matters - a weighted x is silently ignored. If boostrap is true this returns the out of bag error - an array indexed by output feature of how much error exists in that channel - note that they are independent calculations and its upto the user to combine them as desired if an overall error measure is required. A fourth optional parameter is a callback function, used to report progress - it will be called as func(# of work units done, total # of work units). Note that any errors it throws will be silently ignored, including not accepting those parameters.

bootstrap = True to train trees on bootstrap draws of the training data (The default), False to just train on everything.

info_codes = Returns the codes used for creating information gain objects, a string indexed by y feature that gives the code of the info calculator to be used for that features.

info_ratios = Returns the information ratios numpy array, if it has been set. A 2D array, indexed by depth in the first dimension, by y-feature in the second (First dimension accessed modulus). Returns the weight of the entropy from that feature when summing them together, so you can control the objective of the tree.

learn_codes = Returns the codes used for creating learners, a string indexed by x feature that gives the code of the leaner to sugest splits for that features.

max_splits = Maximum number of splits when building a new tree. Defaults so high you will run out of memory first.

min_exemplars = Minimum number of exemplars to allow in a node - no node should ever have less than this count in it. Defaults to 1, making it irrelevant.

opt_features = Number of features to randomly select to try optimising for each split in the forest. Defaults so high as to be irrelevant. The recomended value to set this to is the sqrt of the number of features - a good tradeoff between tree independence and tree performance.

ready = True if its safe to train trees, False if the forest has not been setup - i.e. neither a header has been loaded not a header has been configured. Note that safe to train is not the same as safe to predict - it could contain 0 trees (check with len(forest)).

seed0 = One of the 4 seeds that drives the random number generator used during tree construction. Will change as its moved along by the need for more pseudo-random data.

seed1 = One of the 4 seeds that drives the random number generator used during tree construction. Will change as its moved along by the need for more pseudo-random data.

seed2 = One of the 4 seeds that drives the random number generator used during tree construction. Will change as its moved along by the need for more pseudo-random data.

seed3 = One of the 4 seeds that drives the random number generator used during tree construction. Will change as its moved along by the need for more pseudo-random data.

summary_codes = Returns the codes used for creating summary objects, a string indexed by y feature that gives the code of the summary to be used for that features.

trees = Number of trees in the forest.

x_feat = Number of features it expect of the input x data matrix, which is the data matrix that is known.

y_feat = Number of features it expect of the output y data matrix, which is the data matrix that it is learning to predict from the x matrix.

Tree(object)

A tree within the Forest, but it provides no functionality - its only useful when attached to a Forest. Exists for loading Trees from a seperate source, such as a file or another Forest object (with a compatible configuration). Constructed with a size, in bytes, and impliments the memoryview(tree) interface, so you can extract/set the data within. The contents is entirly streamable, and therefore safe to be dumped to disk/socket etc. - file.write(my_tree) works, as does file.readinto(my_tree). It does have some static methods that provide useful information for loading a tree from a stream - size of the header, and header to total size.

head_size(?) Returns how many bytes are in the header of a Tree, so you can read the entire header from a stream.

human(?) Returns a human understandable representation of the tree - horribly inefficient data structure and of no use beyond human consumption, so for testing/diagnostics/curiosity only. Each leaf node is represented by a string giving the distributions assigned to the output variables, each test by a string. Non-leaf nodes are then represented by dictionaries, containing 'test', 'pass' and 'fail'.

importance(?) Returns a new numpy vector of the importance of each feature as inferred from the trainning of this tree. The vector contains the sum from each split of how much information gain the feature of the split provided, multiplied by the number of training exemplars that went through the split.

nodes(?) Returns how many nodes are in the tree.

size_from_head(?) Given the head, as a read-only buffer compatible object (string, return value of read(), numpy array.) this returns the size of the associated tree, or throws an error if there is something wrong.

trained(?) Returns how many exemplars were used to train this tree.

size = How many bytes are contained within.