Coals

Correlated Occurrence Analogue to Lexical Semantic

Introduction

Coals is an algorithm that uses a collection of documents to construct a a semantic space. The algorithm constructs a word-by-word matrix where each element in the matrix represents how frequently word_i occurs with word_j. The matrix is then normalized by correlation, and any negative values are set to zero and all other values are replaced by it's square root. Then, optionally, the word co-occurence matrix M is reduced using the [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) and retains the U matrix as the difinitive wordspace.

For more information on Coals, the following paper is the central resource:

D. L. T. Rohde, L. M. Gonnerman, D. C. Plaut, "An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence." Cognitive Science

S-Space Implementation

The current S-Space implementation of Coals is captured in two files. Coals.java contains all of the algorithmic implementation, and is suitable for use in other code as a library. CoalsMain.java is a command-line invokable version of Coals that uses the Coals class. This class is provided as coals.jar on the release packages.

Software Requirements

Coals requires use of the SingularValueDecomposition interface.

Running Coals from the command-line

Coals can be invoked either using java edu.ucla.sspace.mains.CoalsMain or through the jar release java -jar coals.jar. Both ways are equivalent.

We provide the following options for changing the behavior of Coals. Standard options can be found here

-s | --reducedDimension <int> Set the number of dimension to reduce to using the Singular Value Decompositon. This is used if --reduce is set.
-n | --dimensions <int> Set the number of columns to keep in the raw co-occurance matrix
-m | --maxWords <int> Set the maximum number of words to keep in the space, ordered by frequency.
-r | --reduce Set to true if the co-occurrance matrix should be reduced using the Singluar Value Decomposition.

The program will then produce a file that contains the entire semantic space. Each line in the file is formatted as follows:

word name|value-1 value-2 ... value-N where N is the number of dimensions in the semantic space.

Acknowledgments

We are grateful to Doug Rohde for making the SVDLIBC program freely available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly