Skip to content
fozziethebeat edited this page Oct 21, 2011 · 2 revisions

Tokenizing Documents

Introduction

Word space algorithms are fundamentally dependent on the tokens they process. The S-Space Package provides several methods to automatically change the tokens in the corpus before they are processed by the semantic space algorithm. This removes the need to retain multiple versions of the same corpus, each with different variations on tokenizing. This also enables using the semantic space algorithms off-the-shelf with no need to modify the code to affect how the tokens are processed.

In general, all options as supported by way of the [IteratorFactory] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/text/IteratorFactory.html) class, which is used by all of the semantic space algorithms to tokenize the input documents.

This class exposes several properties that affect how documents are tokenized. Note that many of the executable .jar files for the algorithms provide shorted versions of these properties on the command line, .e.g. -F for token filtering, which internally set these properties.

Compound Tokens

A compound token exists when multiple tokens are joined as a single unit. Examples of common compound tokens are [collocations] (http://en.wikipedia.org/wiki/Collocation) , [clichés] (http://en.wikipedia.org/wiki/Clich%C3%A9) , or [compound nouns] (http://en.wikipedia.org/wiki/Compound_noun,_adjective_and_verb) . In the normal processing of a document, the tokens are broken up by white space. However, users may specify that certain token sequences, e.g. "white house", should be grouped a single token. This functionality is provided by setting the edu.ucla.sspace.text.IteratorFactory.compoundTokens property to point a file of compound tokens, one on each line.

Filtering

Often when processing documents, certain tokens may need to be excluded. For example, a user may want to remove frequent [closed classes] (http://en.wikipedia.org/wiki/Closed_class) of words such as articles and prepositions, or possibly all punctuation (also commonly known as stop words). The S-Space supports removing these tokens via the TokenFilter class and related FilteredIterator class.

Token filter configurations are specified as a comma-separated list of file names, where each file name has an optional string with values:inclusive or exclusive, which species whether the token are to be used for an exclusive filter. The default value is include. An example configuration might look like:

--tokenFilter=include=english-dictionary.txt,exclude=stop-list.txt

When running an algorithm, this filter may be set using the edu.ucla.sspace.text.IteratorFactory.tokenFilter property.

Stemming

Tokens may also be [stemmed] (http://en.wikipedia.org/wiki/Stemming) . This process attempts to remove morphological changes to a word to reduce it to a base form. For example "fishing," "fishes," "fished" might be reduced to the base form of "fish." This behavior is often used to simplify the term co-occurrences and has other applications in information retrieval. However, not all algorithms perform better when stemming is enabled, since it conflates multiple word meanings into a single token (i.e. increases [polysemy] (http://en.wikipedia.org/wiki/Polysemy) ). The S-Space package provides wrappers for the [Snoball stemmers] (http://snowball.tartarus.org/) .

Users may enable stemming by setting the edu.ucla.sspace.text.IteratorFactory.stem property to a non-null value. The value passed via this property is expected to correspond to a full class descriptor, such as edu.ucla.sspace.text.EnglihStemmer, which will be dynamically loaded.

Document Token Limiting

Often document contain hundreds or thousands of tokens. For some applications it is not practical to process an entire document. A user may therefore artificially limit the number of tokens processed in a corpus by setting the edu.ucla.sspace.text.IteratorFactory.tokenCountLimit value to the desired maximum number of tokens per document.