Skip to content

FrequentlyAskedQuestions

Ken Geis edited this page Jun 10, 2014 · 4 revisions

Frequently Asked Questions

Code

How do I get the code?

You can download a pre-compiled jar, javadoc, and the source files here. Alternatively, you can checkout the code by using git:

git clone git://github.com/fozziethebeat/S-Space.git

If you would like to depend on the S-Space package as a maven artifact, we try to regularly deploy all of our jars to maven central. Our package information is:

<groupId>edu.ucla.sspace</groupId>
<artifactId>sspace</artifactId>
<version>2.0</version>

How stable is the master branch?

We are still actively developing the code base. In general, the interfaces are generally safe to program against. We internally test and review these and they are designed to be stable at commit time. However, the utility classes are much more mutable and may be changed without notice based on our future plans or current needs.

We also have a general rule that the trunk will always compile even though it contain bugs. When in doubt, check the commit logs to see if some code was committed in a partial state. (We rarely do this, but it is often practical when teaming up on a specific piece of code.)

You changed some feature in the trunk that I was using! Can I get it back?

Working from the trunk is often exciting, but dangerous. That said, if we have removed something you needed, [let us know] (mailto:[email protected]). Include a use-case and we will actively look into either replacing the code, or providing alternate functionality that meets your needs.

I found a bug!

Please file it in our [tracker] (https://github.com/fozziethebeat/S-Space/issues). Any information you can provide will help us to fix it quicker. We aim to have a quick turn-around time on publicly reported bugs.

The code doesn't compile!

Check that you're using Java 6. (This is especially important for OS X, where the default Java version is currently Java 5.) If you're building from the master branch and using Java 6, [let us know] (mailto:[email protected]). We maintain that the master should always compile and pass all unit tests. so if it doesn't for some reason, we will fix it right away.

Can I get feature X?

Probably, but how soon is going to be the limiting factor. We often have unfinished code in our branches that may be what you're looking for. Due to changes in focus, we never get around to properly testing and commenting this code, so it sits by not checked in.

If we haven't implemented it yet, and the idea make sense, then we'll move it to the top of our to-do list.

As always, the easiest way to get something implemented is to email us at [[email protected]] (mailto:[email protected] ) and let us know what you want to see.

Why do you support external SVD programs?

Before Adrian Kuhn and David Erni released SVDLIBJ, there was no fast pure-Java implementation of a sparse SVD. Therefore, external programs were required to scale algorithms such as [LSA] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis) to acceptable corpora sizes. Furthermore, since SVDLIBJ has correctness bugs, we also rely on fast and correct SVD implementations.

We continue to support external programs (and the COLT and JAMA libraries) as legacy code, and for those developers who find that the external programs work faster than SVDLIBJ. In our own testing, only SVDLIBC performed faster. However, the performance difference was less than 10%, even when SVDLIBC was compiled with [icc] (http://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler Intel's) compiler using all optimizations. (We think this speaks of the JVM's compiler capabilities quite nicely.)

Running Algorithms

How do I run one of the algorithms?

The primary way to run is using our Main classes, which are command-line executable programs for running the different algorithms. All of the fully implemented algorithms should have an associated class in the edu.ucla.sspace.mains package. The wiki page for a specific algorithm will also have further details.

Is it possible to run them through a graphical interface?

This is not currently supported, but is still possible if you want to do it yourself. Currently, our resources are focused on getting the various algorithms working, so we have not had time to build a nice-looking graphical front end to the algorithms. However, if you want to do this yourself, you can simply instantiate one of the algorithm classes. For example:

import edu.ucla.sspace.lsa.LatentSemanticAnalysis;

public class MyGUI {

    // ....

    public void myFunc() {
        LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
        lsa.processDocument(...);
   }
}

The algorithms are fully self-contained, and can easily be used as libraries.

I keep getting OutOfMemory errors!

This could be due to several issues.

First, consider manually setting the maximum memory for the JVM with -Xmx<size>. See the JVM documentation for further details.

Second, note that many of the algorithms scale based on the number of terms in the corpus. If no pre-processing is done to the corpus, it make contain seemingly duplicate tokens such as:

  • fox
  • fox,
  • "fox
  • fox"
  • 'fox
  • fox.
  • fox?
  • FOX
  • F.O.X. and so on. The best way to assess whether this is the root issue is to count how many unique token types are in the input corpus.

If you think the algorithm should still scale to the number of unique words but is still throwing errors, please let us know.

I keep getting a Exception in thread "main" java.lang.UnsupportedOperationException: No SVD algorithms are available

It looks like our SVD.java code can't find the backing SVD program that actually does the computation.

For SVDLIBC, Matlab and Octave, check if you can call that from the command line. For example, if you have Matlab installed, you should be able to type matlab from the command-line and have the Matlab program start to run.

For JAMA and COLT, check that the .jar files are specified in the path. If you are running an algorithm from a .jar, ensure that you specify the -Djama.path or -Dcolt.path properties as necessary.

If you can't get any of these to work, let us know so we can help strategize. The SVD is a particular pain point for us as well, so we want to make sure "it just works."

When I run with Octave, I see something like "error: svds undefined near line 5 column 14 error: near line 5 of file /tmp/octave-svds8621483587696991940.m"

It looks like your Octave installation doesn't have the optional ARPACK bindings installed, which is why it cant find the svds function. You can get the bindings [here (http://octave.sourceforge.net/arpack/index.html)

How can I convert a .sspace file to a matrix?

This question comes up also when a user wants to write the converted matrix to a file to interface with an external program, such as Matlab (e.g. to plot the vectors). An .sspace file can be converted as follows:

SemanticSpace sspace = ...; // Wherever you get it from
int numVectors = sspace.getWords().size();
List<DoubleVector> vectors = new ArrayList<DoubleVector>();
for (String word : sspace.getWords())
    vectors.add(Vectors.asDouble(sspace.getVector(word)));
                                                                    
Matrix m = Matrices.asMatrix(vectors);

// If you then want to write the matrix out to disk, do the following
MatrixIO.Format fmt = ...;   // specify here
File outputMatrixFile = ...; // also specify
MatrixIO.writeMatrix(m, outputMatrixFile, fmt);
// Reminder, if you still want the word-to-row mapping, write out the words array too

What would be a useful classpath?

For Linux users using bash, one can do:

export CLASSPATH=bin/:classes/:.

for csh one can do:

setenv CLASSPATH "bin/:classes/:."

Corpus Questions

I want to filter out certain words in my corpus (i.e. stop words)

This is possible in any of the algorithms. See the documentation for specifics. The current code supports both excluding words, which is useful for stop lists, and also be strictly inclusive and keep only a recognized set of words.

How big should my corpus be?

That largely depends on the algorithm. If your corpus is too small, the words may not have sufficient co-occurrence statistics to form a semantic vector that is actually representative. Furthermore, some algorithms such as Latent Relational Analysis require significantly more documents to produce good semantics. Also, note that even with a large corpus, some words may not occur frequently enough to generate accurate vectors.

Having too large of a corpus is also an important issue. LSA is much more sensitive to the number of documents in the corpus. However, other word co-occurrence algorithms such as HAL and Random Indexing are still dependent on the number of words. Some algorithms provide additional options to only calculate semantics for a specific number of words, which saves a large amount of space.

Other Questions

I want to use your code in a research paper. Do I need to do anything?

Please cite our ACL Systems paper: Jurgens and Stevens, (2010). The S-Space Package: An Open Source Package for Word Space Models. In System Papers of the Association of Computational Linguistics.

Also, we'd love to hear from you as well. Please send us email and let us know if you have any feedback on the package (e.g. features you would have liked, problems with documentation, or any other suggestions).

Who do I contact with questions about the project?

If it is a general question, contact [[email protected]] (mailto:[email protected]). If you need a private question answered, please email David Jurgens or Keith Stevens.

Why does the project activity tend to drop off in the summer?

Two possible reasons. One, we often work on much larger scale projects during the summer. We intend to port these to S-Space package once they are finished, but we don't want to leave intermediate files in the repository until they're verified as working.

Second during the summer, many of the graduate students and undergraduates working on the project leave on vacation or for work. This puts a time constraint on how much work can be done. However, work is still being done, even if it isn't committed.

Why the name Airhead Research?

Airhead Research stands for AI-Head Research. This is a throw-back from an older name for the lab, which we have taken as our own. We're still working on a nice-sounding acronym for "head" to fill out the title.

Clone this wiki locally