FrequentlyAskedQuestions

Frequently Asked Questions

Code

How do I get the code?

You can download a pre-compiled jar, javadoc, and the source files here. Alternatively, you can checkout the code by using git:

git clone git://github.com/fozziethebeat/S-Space.git

If you would like to depend on the S-Space package as a maven artifact, we try to regularly deploy all of our jars to maven central. Our package information is:

<groupId>edu.ucla.sspace</groupId>
<artifactId>sspace</artifactId>
<version>2.0</version>

How stable is the master branch?

We are still actively developing the code base. In general, the interfaces are generally safe to program against. We internally test and review these and they are designed to be stable at commit time. However, the utility classes are much more mutable and may be changed without notice based on our future plans or current needs.

We also have a general rule that the trunk will always compile even though it contain bugs. When in doubt, check the commit logs to see if some code was committed in a partial state. (We rarely do this, but it is often practical when teaming up on a specific piece of code.)

You changed some feature in the trunk that I was using! Can I get it back?

Working from the trunk is often exciting, but dangerous. That said, if we have removed something you needed, [let us know] (mailto:[email protected]). Include a use-case and we will actively look into either replacing the code, or providing alternate functionality that meets your needs.

I found a bug!

Please file it in our [tracker] (https://github.com/fozziethebeat/S-Space/issues). Any information you can provide will help us to fix it quicker. We aim to have a quick turn-around time on publicly reported bugs.

The code doesn't compile!

Check that you're using Java 6. (This is especially important for OS X, where the default Java version is currently Java 5.) If you're building from the master branch and using Java 6, [let us know] (mailto:[email protected]). We maintain that the master should always compile and pass all unit tests. so if it doesn't for some reason, we will fix it right away.

Can I get feature X?

Probably, but how soon is going to be the limiting factor. We often have unfinished code in our branches that may be what you're looking for. Due to changes in focus, we never get around to properly testing and commenting this code, so it sits by not checked in.

If we haven't implemented it yet, and the idea make sense, then we'll move it to the top of our to-do list.

As always, the easiest way to get something implemented is to email us at [[email protected]] (mailto:[email protected] ) and let us know what you want to see.

Why do you support external SVD programs?

Before Adrian Kuhn and David Erni released SVDLIBJ, there was no fast pure-Java implementation of a sparse SVD. Therefore, external programs were required to scale algorithms such as [LSA] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis) to acceptable corpora sizes. Furthermore, since SVDLIBJ has correctness bugs, we also rely on fast and correct SVD implementations.

We continue to support external programs (and the COLT and JAMA libraries) as legacy code, and for those developers who find that the external programs work faster than SVDLIBJ. In our own testing, only SVDLIBC performed faster. However, the performance difference was less than 10%, even when SVDLIBC was compiled with [icc] (http://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler Intel's) compiler using all optimizations. (We think this speaks of the JVM's compiler capabilities quite nicely.)

Running Algorithms

How do I run one of the algorithms?

The primary way to run is using our Main classes, which are command-line executable programs for running the different algorithms. All of the fully implemented algorithms should have an associated class in the edu.ucla.sspace.mains package. The wiki page for a specific algorithm will also have further details.

Is it possible to run them through a graphical interface?

This is not currently supported, but is still possible if you want to do it yourself. Currently, our resources are focused on getting the various algorithms working, so we have not had time to build a nice-looking graphical front end to the algorithms. However, if you want to do this yourself, you can simply instantiate one of the algorithm classes. For example:

import edu.ucla.sspace.lsa.LatentSemanticAnalysis;

public class MyGUI {

    // ....

    public void myFunc() {
        LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
        lsa.processDocument(...);
   }
}

The algorithms are fully self-contained, and can easily be used as libraries.

I keep getting OutOfMemory errors!

This could be due to several issues.

First, consider manually setting the maximum memory for the JVM with -Xmx<size>. See the JVM documentation for further details.

Second, note that many of the algorithms scale based on the number of terms in the corpus. If no pre-processing is done to the corpus, it make contain seemingly duplicate tokens such as:

fox
fox,
"fox
fox"
'fox
fox.
fox?
FOX
F.O.X. and so on. The best way to assess whether this is the root issue is to count how many unique token types are in the input corpus.

If you think the algorithm should still scale to the number of unique words but is still throwing errors, please let us know.

I keep getting a `Exception in thread "main" java.lang.UnsupportedOperationException: No SVD algorithms are available`

It looks like our SVD.java code can't find the backing SVD program that actually does the computation.

For SVDLIBC, Matlab and Octave, check if you can call that from the command line. For example, if you have Matlab installed, you should be able to type matlab from the command-line and have the Matlab program start to run.

For JAMA and COLT, check that the .jar files are specified in the path. If you are running an algorithm from a .jar, ensure that you specify the -Djama.path or -Dcolt.path properties as necessary.

If you can't get any of these to work, let us know so we can help strategize. The SVD is a particular pain point for us as well, so we want to make sure "it just works."

When I run with Octave, I see something like "error: `svds undefined near line 5 column 14 error: near line 5 of file /tmp/octave-svds8621483587696991940.m`"

It looks like your Octave installation doesn't have the optional ARPACK bindings installed, which is why it cant find the svds function. You can get the bindings [here (http://octave.sourceforge.net/arpack/index.html)

How can I convert a .sspace file to a matrix?

This question comes up also when a user wants to write the converted matrix to a file to interface with an external program, such as Matlab (e.g. to plot the vectors). An .sspace file can be converted as follows:

SemanticSpace sspace = ...; // Wherever you get it from
int numVectors = sspace.getWords().size();
List<DoubleVector> vectors = new ArrayList<DoubleVector>();
for (String word : sspace.getWords())
    vectors.add(Vectors.asDouble(sspace.getVector(word)));
                                                                    
Matrix m = Matrices.asMatrix(vectors);

// If you then want to write the matrix out to disk, do the following
MatrixIO.Format fmt = ...;   // specify here
File outputMatrixFile = ...; // also specify
MatrixIO.writeMatrix(m, outputMatrixFile, fmt);
// Reminder, if you still want the word-to-row mapping, write out the words array too

What would be a useful classpath?

For Linux users using bash, one can do:

export CLASSPATH=bin/:classes/:.

for csh one can do:

setenv CLASSPATH "bin/:classes/:."

Corpus Questions

I want to filter out certain words in my corpus (i.e. stop words)

This is possible in any of the algorithms. See the documentation for specifics. The current code supports both excluding words, which is useful for stop lists, and also be strictly inclusive and keep only a recognized set of words.

How big should my corpus be?

That largely depends on the algorithm. If your corpus is too small, the words may not have sufficient co-occurrence statistics to form a semantic vector that is actually representative. Furthermore, some algorithms such as Latent Relational Analysis require significantly more documents to produce good semantics. Also, note that even with a large corpus, some words may not occur frequently enough to generate accurate vectors.

Having too large of a corpus is also an important issue. LSA is much more sensitive to the number of documents in the corpus. However, other word co-occurrence algorithms such as HAL and Random Indexing are still dependent on the number of words. Some algorithms provide additional options to only calculate semantics for a specific number of words, which saves a large amount of space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FrequentlyAskedQuestions

Frequently Asked Questions

Code

How do I get the code?

How stable is the master branch?

You changed some feature in the trunk that I was using! Can I get it back?

I found a bug!

The code doesn't compile!

Can I get feature X?

Why do you support external SVD programs?

Running Algorithms

How do I run one of the algorithms?

Is it possible to run them through a graphical interface?

I keep getting OutOfMemory errors!

I keep getting a `Exception in thread "main" java.lang.UnsupportedOperationException: No SVD algorithms are available`

When I run with Octave, I see something like "error: `svds undefined near line 5 column 14 error: near line 5 of file /tmp/octave-svds8621483587696991940.m`"

How can I convert a .sspace file to a matrix?

What would be a useful classpath?

Corpus Questions

I want to filter out certain words in my corpus (i.e. stop words)

How big should my corpus be?

Other Questions

I want to use your code in a research paper. Do I need to do anything?

Who do I contact with questions about the project?

Why does the project activity tend to drop off in the summer?

Why the name Airhead Research?

Clone this wiki locally

FrequentlyAskedQuestions

Frequently Asked Questions

Code

How do I get the code?

How stable is the master branch?

You changed some feature in the trunk that I was using! Can I get it back?

I found a bug!

The code doesn't compile!

Can I get feature X?

Why do you support external SVD programs?

Running Algorithms

How do I run one of the algorithms?

Is it possible to run them through a graphical interface?

I keep getting OutOfMemory errors!

I keep getting a Exception in thread "main" java.lang.UnsupportedOperationException: No SVD algorithms are available

When I run with Octave, I see something like "error: svds undefined near line 5 column 14 error: near line 5 of file /tmp/octave-svds8621483587696991940.m"

How can I convert a .sspace file to a matrix?

What would be a useful classpath?

Corpus Questions

I want to filter out certain words in my corpus (i.e. stop words)

How big should my corpus be?

Other Questions

I want to use your code in a research paper. Do I need to do anything?

Who do I contact with questions about the project?

Why does the project activity tend to drop off in the summer?

Why the name Airhead Research?

Clone this wiki locally

I keep getting a `Exception in thread "main" java.lang.UnsupportedOperationException: No SVD algorithms are available`

When I run with Octave, I see something like "error: `svds undefined near line 5 column 14 error: near line 5 of file /tmp/octave-svds8621483587696991940.m`"