Skip to content

Latest commit

 

History

History
154 lines (116 loc) · 5.9 KB

README.md

File metadata and controls

154 lines (116 loc) · 5.9 KB

Perceptronix: Linear model template library for C++

This library provides C++ classes for linear classifiers for binary-valued (i.e., nominal, "one-hot") features, such as encountered in natural language processing. Models are trained from labeled examples using the perceptron learning algorithm with or without weight averaging. The former produces an informal form of L1 normalization (with all else held equal, it favors zero weights over non-zero weights, as weights are initialized to zero and a large margin is not part of the objective) whereas the latter produces an informal form of L2 normalization (with all else held equal, it prefers weights with smaller magnitudes, as weights are initialized to zero). Both binomial and multinomial classifiers are supported. Weights may be stored either in sparse (string-keyed hash table) or dense (integer-keyed array) containers. Models may be serialized and deserialized using protocol buffers. Users may specify a large margin constant "C".

API

C++

The normal workflow is to create a ...Model object, train it, then average it, creating an immutable, serializable object, E.g.:

  ::perceptronix::SparseDenseMultinomialModel model(nfeats, nlabels, c);
  // ... training ...
  model.Average();
  // ... inference ...

The major classes are:

  • DenseBinomialModel: A binomial classifier using a dense weight table; users must set the maximum number of unique features at construction time.
  • SparseBinomialModel: A binomial classifier using a dynamically-sized sparse weight table.
  • DenseMultinomialModel: A multinomial classifier using dense weight tables; users must set the maximum number of unique features and labels at construction time.
  • SparseDenseMultinomialModel: A multinomial classifier using a dynamically-sized hash table for the outer table and dense arrays for the inner table.
  • SparseMultinomialModel: A multinomial classifier using dynamically-sized hash tables for both outer and inner tables.

There are also HMM-like classes for greedy sequential decoding:

  • SparseBinomialSequentialModel
  • SparseDenseMultinomialSequentialModel
  • SparseMultinomialSequentialModel

Python

Each of the above classifiers has a correspondent type in the Python API. Take a look at apps/sentence_tokenizer/model.py for a worked example.

Design

Weights

Weight averaging is done using a lazy strategy. An averaged weight consists of:

  • The true weight (weight_), used for inference during training, and updated using the perceptron learning rule
  • The summed weight (summed_weight_), which may or may not be "fresh"
  • A timestamp (time_) indicating when the averaged weight was last "freshened"

At time, the time elapsed since the summed weight was last freshened is given by time - time_.

The summed weight is freshened by adding to it the true weight multiplied by the time elapsed since timestamp.

At averaging time, weights are freshened and then the summed weight is divided by the current time.

Tables

The header table.h wraps the two types of weight containers so that they provide a common interface for access, mutation, and iteration.

The resulting classes are then used as template template parameters (no, that's not a typo) to the classifiers themselves.

Virtual dispatch is not used anywhere, so there is no performance penalty for this polymorphism.

Labels and features

Labels and features are to either either unsigned integers (for dense tables) or strings (for sparse tables), and therefore are passed by value in both cases. In the latter case, it is assumed that the strings are short so that we can take advantage of "short string" optimizations.

There is a built-in bias term, so users do not need to provide one.

For dense tables, passing a feature integer greater than or equal to nfeats is Undefined Behavior.

For sparse tables, the empty string is reserved for internal use, and attempting to use it as either a label or feature is Undefined Behavior.

Performance

Instead of the naïve strategy of treating labels as the outer indices and features as the inner indices, this library uses outer feature indices and inner label indices in the multinomial case. This allow sparse multinomial models to perform particularly well in the very common case that most features have zero weights for all labels; the cost of inference for an "irrelevant" feature---one which has zero weights for all labels, or one which has a non-zero weight for some labels but not the one under consideration---is merely that of a hash table miss.

When you have sparse features, but you know the maximum number of labels at construction time (and this number is small, say, less than 64), consider using SparseDenseMultinomialModel rather than SparseMultinomialModel. This model uses std::valarray for inner weight tables, which should allow the "dot product" during scoring to use fast vectorized arithmetic operations.

If you do use SparseMultinomialModel, consider tuning the nlabels constructor argument, which is used to reserve memory for inner tables.

Dependencies

The library depends on C++14 and protobuf 3.0 or greater. It should compile with g++ or clang++. Users may also need the protobuf compilation tool protoc, which is sometimes distributed separately from protobuf itself.

The Python wrapper and applications are written for Python 3.7+. Compiling the wrapper requires Cython, and the applications require the nlup and regex Python libraries, all of which are available from PyPI.

Author

Kyle Gorman

License

See LICENSE.