Skip to content
fozziethebeat edited this page Oct 26, 2011 · 5 revisions

Correlated Occurrence Analogue to Lexical Semantics

Introduction

Coals is an algorithm that uses a collection of documents to construct a a semantic space. The algorithm constructs a word-by-word matrix where each element in the matrix represents how frequently word_i occurs with word_j. The matrix is then normalized by correlation, and any negative values are set to zero and all other values are replaced by it's square root. Then, optionally, the word co-occurence matrix M is reduced using the [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) and retains the U matrix as the difinitive wordspace.

For more information on Coals, the following paper is the central resource:

  • D. L. T. Rohde, L. M. Gonnerman, D. C. Plaut, "An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence." Cognitive Science

S-Space Implementation

The current S-Space implementation of Coals is captured in two files. Coals.java contains all of the algorithmic implementation, and is suitable for use in other code as a library. CoalsMain.java is a command-line invokable version of Coals that uses the Coals class. This class is provided as coals.jar on the release packages.

Software Requirements

Coals requires that a [Singular Value Decomposion] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) method be installed.

Running Coals from the command-line

Coals can be invoked either using java edu.ucla.sspace.mains.CoalsMain or through the jar release java -jar coals.jar. Both ways are equivalent.

We provide the following options for changing the behavior of Coals. Standard options can be found here

  • -s | --reducedDimension <int> Set the number of dimension to reduce to using the Singular Value Decompositon. This is used if --reduce is set.
  • -n | --dimensions <int> Set the number of columns to keep in the raw co-occurance matrix
  • -m | --maxWords <int> Set the maximum number of words to keep in the space, ordered by frequency.
  • -r | --reduce Set to true if the co-occurrance matrix should be reduced using the Singluar Value Decomposition.

The program will then produce a file that contains the entire semantic space. Each line in the file is formatted as follows:

word name|value-1 value-2 ... value-N

where N is the number of dimensions in the semantic space.

Acknowledgments

  • We are grateful to Doug Rohde for making the SVDLIBC program freely available.