Skip to content
Tom SF Haines edited this page Mar 30, 2016 · 1 revision

Stochastic Woodland (Depreciated)

Overview

Stochastic Woodlands

Note - this module has been superseded by the 'df' module. It should not be used by new code, and porting old code to the 'df' module might prove advantageous.

An implementation of the method popularised by the paper 'Random Forests' by L. Breiman which is a refinement, with a decent helping of rigour, of the papers 'Random Decision Forests' by T. K. Ho and 'Shape Quantization and Recognition with Randomized Trees' by Y. Amit and D. Geman.

Alternatively, and mildly comically, named on account of a trademark (If someone could explain the financial basis for getting a trademark on a name that describes a mathematical algorithm, for which no IP protection exists (Theoretically speaking - obviously people abuse the patent system all the time.), to me I might find this a lot less comical.).

A fairly basic implementation - just builds the trees and works out error estimates using the out-of-bag technique so the parameters can be tuned. Does support the weighting of samples for if getting some right is more important than others. The trees constructed are id3, with the c4.5 feature of supporting continuous attributes, in addition to discrete. Previously unseen attribute values are handled by storing the distribution over categories at every node, so that can be returned on an unknown. Final output is always a probability distribution, effectively the normalised number of votes for each category.

Implemented using pure python, with numpy, so not very efficient, but because its such an efficient algorithm anyway its still fast enough for real world use on decently sized data sets. Obviously its quite a simple algorithm, such that anyone with a basic understanding of machine learning should be able to implement it, but it is well commented and the code should be easy to understand.

Contains the following files:

swood.py - Contains the SWood object, that is basically all you need.

dec_tree.py - Contains the DecTree object, that implements a decision tree in case that is all you want. Where most of the systems functionality actually is.

test_*.py - Various test files. Also serve as examples of how to use the system.

make_doc.py - Makes the html documentation.

readme.txt - This file, which is also copied into the html documentation.


Classes

SWood()

A stochastic woodland implimentation (Usually called a random forest:-P). Nothing that fancy - does classification and calculates/provides the out-of-bag error estimate so you can tune the parameters.

classify(self, int_vec, real_vec) Classifies an example, given the discrete and continuous feature vectors. Returns a dictionary indexed by categories that goes to the probability of that category being assigned; categories can be excluded, implying they have a value of one, but the returned value is actually a default dict setup to return 0.0 when you request an unrecognised key. The probabilities will of course sum to 1.

multi_classify(self, int_dm, real_dm, callback) Identical to classify, except you give it a data matrix and it classifies each entry in turn, returning a list of distributions. Note that in the cass of using the compressed version of this class using this is essential to be computationally reasonable.

oob_success(self) Returns the success rate ([0.0,1.0], more is better.) for the tree that has been trained. Calculated using the out-of-bag techneque, and primarilly exists so you can run with multiple values of option_count to find the best parameter, or see the effect of tree_count.

tree_list(self) Returns a list of all the decision trees in the woodland. Note that when in compressed mode these will be strings of bzip2 compressed and pickled trees, which can be resurected using pickle.loads(bz2.decompress(<>)).

DecTree()

A decision tree, uses id3 with the c4.5 extension for continuous attributes. Fairly basic - always grows fully and stores a distribution of children at every node so it can fallback for previously unseen attribute categories. Allows the trainning vectors to be weighted and can be pickled. An effort has been made to keep this small, due to the fact that its not unusual to have millions in memory.

classify(self, int_vec, real_vec) Given a pair of vectors, one for discrete attributes and another for continuous atributes this returns the trees estimated distribution for the exampler. This distribution will take the form of a dictionary, which you must not modify, that is indexed by categories and goes to a count of how many examples with that category were in that leaf node. 99% of the time only one category should exist, though various scenarios can result in there being more than 1.

entropy(self) Returns the entropy of the data that was used to train this node. Really an internal method, exposed in case of rampant curiosity. Note that it is in nats, not bits.

getChildren(self) Returns a dictionary of children nodes indexed by the attribute the decision is being made on if it makes a discrete decision, otherwise None. Note that any unseen attribute value will not be included.

getHigh(self) If it is a continuous decision node this returns the branch down which samples with the attribute higher than or equal to the threshold go to; otherwise None.

getIndex(self) Returns the index of either the discrete column or continuous column which it decides on, or None if it is a leaf.

getLow(self) If it is a continuous decision node this returns the branch down which samples with the attribute less than the threshold go to; otherwise None.

getThreshold(self) If it is a continuous node this returns the threshold between going down the low and high branches of the decision tree, otherwise returns None.

isContinuous(self) Returns True if it makes a decision by splitting a continuous node, False if its is either discrete or a leaf.

isDiscrete(self) Returns True if it makes its decision based on a discrete attribute, False if it is continuous or a leaf.

isLeaf(self) Returns True if it is a leaf node, False otherwise.

prob(self) Returns the distribution over the categories of the trainning examples that went through this node - if this is a leaf its likelly to be non-zero for just one category. Represented as a dictionary category -> weight that only includes entrys if they are not 0. weights are the sum of the weights for the input, and are not normalised.

size(self) Returns how many nodes make up the tree.

Clone this wiki locally