Skip to content

LatentSemanticAnalysis

Brice Thomas edited this page Mar 31, 2015 · 4 revisions

Latent Semantic Analysis

Introduction

Latent Semantic Analysis (LSA) is an algorithm that uses a collection of documents to construct a semantic space. The algorithm constructs a word-by-document matrix where each row corresponds to a unique word in the document corpus and each column corresponds to a document. The value at each position is how many times the row's word occurs in the column's document. Then the [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) is calculated for the word-document matrix to produce three matrices (UΣV), U - the wordspace, Σ - the singular values, and V - the document space. The columns of U are then truncated to a small number of dimensions (typically 300), which produces the final semantic vectors.

For more information on LSA, see the [Wikipedia page] (http://en.wikipedia.org/wiki/Latent_semantic_analysis) on LSA. Also the following papers give a good introduction to the uses of LSA:

S-Space Implementation

The current S-Space implementation of LSA is captured in two files. LatentSemanticAnalysis.java contains all of the algorithmic implementation, and is suitable for use in other code as a library. LSAMain.java is a command-line invokable version of LSA that uses the LatentSemanticAnalysis class.

Software Requirements

Our LSA implementation requires installation of a [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) method.

Running LSA from the command-line

LSA can be invoked either using java edu.ucla.sspace.mains.LSAMain or through the jar release java -jar lsa.jar Both ways are equivalent.

We provide the following options for changing the behavior of LSA. For standard options, see Mains.

  • LSA Options

    • -n, --dimensions <int> how many dimensions to use for the LSA vectors. See LatentSemanticAnalysis for default value
    • -p, --preprocess <class name> specifies an instance of a Transform to use in preprocessing the word-document matrix compiled by LSA prior to computing the SVD.
  • Advanced Options

    • -S, --svdAlgorithm The --svdAlgorithm provides a way to manually specify which algorithm should be used internally. This option should not be used normally, as LSA will select the fastest algorithm available. However, in the event that it is needed, valid options are: SVDLIBC, MATLAB, OCTAVE, JAMA and COLT

Important Note

The LSA program is the definitive authority on the current set of options and their configurations. If you find an option is incorrectly specified on this page, please [let us know] (mailto:[email protected]). Full documentation may be found on the command line by running the lsa.jar program without any options.

Example Command Lines

Generates a simple .sspace file with the default 300 dimensions.

java -jar lsa.jar -d corpus.txt  my-lsa-output.sspace

Has the JVM use 4GB of ram when performing LSA (more ram is almost always better)

java -Xmx8g -jar lsa.jar -d corpus.txt my-lsa-output.sspace

Removes stop words from the corpus while processing. (Note: LSA doesn't do this in the original papers)

java -Xmx8g -jar lsa.jar -d corpus.txt -F exclude=stopwords.txt my-lsa-output-no-stopwords.sspace

Generates an LSA space with 500 dimensions

java -Xmx8g -jar lsa.jar -d corpus.txt -n 500 my-lsa-output-500dim.sspace

Generates an LSA space with known compound words

java -Xmx8g -jar lsa.jar -d corpus.txt -C my-list-of-ngrams.txt my-lsa-output-with-ngrams.sspace

Runs LSA with SVDLIBJ specifically (Note: the algorithm choice shouldn't affect the final vector values - only the runtime of LSA)

java -Xmx8g -jar lsa.jar -d corpus.txt -S SVDLIBJ my-lsa-output.sspace

Acknowledgments

  • We are grateful for the advice and assistance of Tom Landauer, Walter Kintsch and Praful Mangalath of the Latent Semantic Analysis group at the University of Colorado, Boulder.

  • We are grateful to Doug Rohde for making the SVDLIBC program freely available.

  • We are very grateful to Adrian Kuhn and David Erni for creating SVDLIBJ by porting SVDLIBC to Java.