-
Notifications
You must be signed in to change notification settings - Fork 106
Coals
Coals is an algorithm that uses a collection of documents to construct a a semantic space. The algorithm constructs a word-by-word matrix where each element in the matrix represents how frequently word_i occurs with word_j. The matrix is then normalized by correlation, and any negative values are set to zero and all other values are replaced by it's square root. Then, optionally, the word co-occurence matrix M is reduced using the [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) and retains the U matrix as the difinitive wordspace.
For more information on Coals, the following paper is the central resource:
- D. L. T. Rohde, L. M. Gonnerman, D. C. Plaut, "An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence." Cognitive Science
The current S-Space implementation of Coals is captured in two files. Coals.java
contains all of the algorithmic implementation, and is suitable for use in other code as a library. CoalsMain.java
is a command-line invokable version of Coals that uses the Coals
class. This class is provided as coals.jar on the release packages.
Coals requires use of the SingularValueDecomposition interface.
Coals can be invoked either using java edu.ucla.sspace.mains.CoalsMain
or through the jar release java -jar coals.jar
. Both ways are equivalent.
We provide the following options for changing the behavior of Coals. Standard options can be found here
-
-s | --reducedDimension <int>
Set the number of dimension to reduce to using the Singular Value Decompositon. This is used if --reduce is set. -
-n | --dimensions <int>
Set the number of columns to keep in the raw co-occurance matrix -
-m | --maxWords <int>
Set the maximum number of words to keep in the space, ordered by frequency. -
-r | --reduce
Set to true if the co-occurrance matrix should be reduced using the Singluar Value Decomposition.
The program will then produce a file that contains the entire semantic space. Each line in the file is formatted as follows:
word name|value-1 value-2 ... value-N where N is the number of dimensions in the semantic space.
- We are grateful to Doug Rohde for making the SVDLIBC program freely available.