Skip to content

LogonModeling

StephanOepen edited this page Dec 1, 2008 · 25 revisions

Overview

This page contains various code examples showing how to estimate and apply statistical models within LOGON. For more detailed information on feature types, estimation parameters, or the experimentation environment in general, see [http://www.velldal.net/erik/pubs/Velldal08.pdf Velldal (2008)].

Discriminative Modeling

In the following, we assume that 'generation treebanks' for the LOGON JHPSTG and Rondane corpora are available. For the HandOn release (of November 2008) of the LOGON system, these treebanks can be installed into the lingo/redwoods/ directory from SVN; see the LogonExtras page for instructions on how to install add-on LOGON components. However, in principle, these instructions should be applicable to other Redwoods-style treebanks.

We further assume that the complete LOGON system and correct grammar (the ERG from lingo/redwoods/erg/, in our case) are already loaded.

Set the feature parameters. The system defaults correspond to:

  (let ((*feature-grandparenting* 4)
        (*feature-active-edges-p* t)
        (*feature-ngram-size* 4)
        (*feature-ngram-back-off-p* t)
        (*feature-ngram-tag* :type)
        (*feature-use-preterminal-types-p* t)
        (*feature-lexicalization-p* t)
        (*feature-constituent-weight* 2)
        (*feature-lm-p* 10)
        (*feature-frequency-threshold* nil))

    ...)

Create a feature cache for the (virtual) profile jhpstg.g (we typically use the .g suffix for generation treebanks):

  (setq gold "jhpstg.g")
  (operate-on-profiles (list gold) :task :fc))

Intended as a one-time operation, the feature caching extracts all the features from the treebank and stores them in a (Berkeley DB) database within the respective profile directory (named fc.bdb). When running experiments later, this means that we simply look up the features in the DB, saving us the cost of extraction. A symbol table named fc.mlm (also created within the jhpstg.g profile for the example above) records the mapping from symbolic feature representations to numerical indexes (as used for model estimation and DB storage). The symbol table is only referenced when exporting or applying a model to new data (see the example below), but it can also be useful to inspect manually, e.g. to confirm that features have the correct form, plausible counts, plausible value ranges, etc.

Example of how to run a single experiment using 5-fold cross-validation:

  (setq test "jhpstg.t")
  (tsdb :create test :skeleton "jhpstg")
  (rank-profile gold test :nfold 5)

Running a batch of 10-fold MaxEnt experiments on jhpstg.g, iterating over different configurations of features and estimation parameters (the top-level function batch-experiment() performs an exhaustive 'grid search' over all combinations of specified parameter values):

  (batch-experiment
   :type :mem
   :variance '(nil 1000 100 10 1 1.0e-1 1.0e-2)
   :absolute-tolerance 1.0e-10
   :source "jhpstg.g"
   :skeleton "jhpstg"
   :random-sample-size nil
   :ngram-size '(0 1 2 3)
   :active-edges-p nil
   :grandparenting '(0 1 2 3)
   :lm-p 10
   :counts-relevant 1
   :nfold 10
   :compact nil)

The following gives a brief explanation of the various keyword arguments. The :variance parameter governs the Gaussian prior on feature weights.; :absolute-tolerance governs the convergence threshold. Specifying a non-nil (integer) value n for :random-sample-size means that only a random selection of (maximally) n non-preferred candidates for each item is included in the training data. The parameter :counts-relevant governs a frequency-based cutoff on feature values. The keywords :ngram-size, :active-edges-p, and :grandparenting allow iteration over feature parameters. Note that specifying :lm-p 10 means that the value of the language model feature is divided by 10; this is basically a hack to avoid numerical problems during estimation. To leave out the LM feature, call with :lm-p nil instead. Specifying :type :mem means that we are training a conditional maximum entropy model (aka log-linear model). The value of :type could also be :svm if you have SVMlight installed (it is currently not part of the LOGON dstribution). The boolean-valued :compact governs the naming convention when creating target profiles, i.e. if the profile names for the 10-fold cross validation experiments look excessively long (or even cause issues with OS-imposed limits on the total length of pathnames), try t as the :compact value.

Example of how to estimate and export a maxent model:

  (let ((*feature-grandparenting* 3)
        (*feature-ngram-size* 3)
        (*feature-lm-p* nil)
        (*maxent-variance* 8e-4)
        (*feature-frequency-threshold* (make-counts :relevant 1)))
    (train "jhpstg.g" "jhpstg.g.mem" :fcp nil :type :mem))

This writes the estimated model to jhpstg.g.mem (for more information on the format, see Chapter 6 of [http://www.velldal.net/erik/pubs/Velldal08.pdf Velldal (2008)]). The keyword argument :fcp nil means that we do not want to create a feature cache, but rather use the one we already have.

Applying the model trained above to the generation treebank rondane.g:

  (tsdb :create "rondane.t" :skeleton "rondane")

  (operate-on-profiles
    (list "rondane.g") :model (read-model "jhpstg.m.mem")
    :target "rondane.t" :task :rank)
Clone this wiki locally