-
Notifications
You must be signed in to change notification settings - Fork 106
GettingStarted
The S-Space package is a software library and suite of tools for building semantic spaces. Semantics space algorithms capture the statistical regularities of words in a text corpora and map each word to a high-dimensional vector that represents the semantics. For example, if we observe the sentences:
- He drank the foobar at the game.
- Foobar is the number three beverage.
- A case of foobar is cheap compared to other sodas.
- Foobar tastes better when cold.
We notice that foobar occurs with words that give strong indications of what the word means, e.g. "drink," "beverage" or "sodas." The surrounding words form a context that a semantic space algorithm can use to determine the semantics. As we see more contexts for "foobar," certain regularities will show up in the surrounding words. The simplest algorithms simply build a matrix (often referred to as a co-occurrence matrix) that keeps track of what words show up in another word's context.
There are many types of semantic space algorithms that move beyond the simple co-occurrence model described above. Popular variations include:
- limiting the context to a fixed-size window around a word
- syntactic relationships, e.g. adjective-noun, as co-occurrence
- semantic relationships, e.g. subject-object, as co-occurrence
- performing dimensionality reduction on the co-occurrence matrix, e.g. [SVD] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition)
- identifying patterns between pairs of words
The S-Space package provides a uniform framework in which to implement all of these algorithms and their specific parameters, easily and correctly. Furthermore, the package also provides facilities for evaluating a semantic space against commonly used benchmarks such as the [ESL synonym questions] (http://www.aclweb.org/aclwiki/index.php?title=ESL_Synonym_Questions_(State_of_the_art)) or comparing similarity with the human [word similarity] (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/) judgments gathered by Finkelstein et al. (2002).
The first step is downloading the code. The preferred method at this time of this writing is to get the latest build from the [git repository] (git://github.com/fozziethebeat/S-Space.git).
git clone git://github.com/fozziethebeat/S-Space.git
If successfully checked out, you should see a directory layout as follows:
user@machine$ ls bin build.xml classes data javadoc lib licenses src test
The directories contain the following information:
-
data
- commonly used data files such as stop word lists -
lib
- libraries and dependencies -
opt
- optional code that extends the S-Space package -
src
- all of the source code for the S-Space package -
test
- all of the unit-test code
The S-Space package uses [Apache Maven] (http://maven.apache.org/) to build the project. All of the classes can be compiled with:
user@machine$ mvn compile
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building S-Space Package 2.0
[INFO] ------------------------------------------------------------------------
[INFO] Compiling 495 source files to /home/stevens35/devel/S-Space/target/classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.309s
[INFO] Finished at: Thu Oct 27 08:56:28 PDT 2011
[INFO] Final Memory: 24M/361M
[INFO] ------------------------------------------------------------------------
You can also run unit tests by running
user@machine$ mvn test
Which will run all unit tests and report any failues. After running, you should see output that ends with
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.665s
[INFO] Finished at: Thu Oct 27 08:58:20 PDT 2011
[INFO] Final Memory: 16M/455M
[INFO] ------------------------------------------------------------------------
If the build fails, then at least one unit test has failed. Reports of all unit tests will be stored under `target/surefire-reports/TEST-path.of.unit.test.xml
Each algorithm in the S-Space package implements the [SemanticSpace] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/common/SemanticSpace.html) interface. This allows a common API for interacting with the algorithm. The interface defines 5 methods:
-
getSpaceName()
- Returns a unique string describing the name and configuration of this algorithm. Any configurable parameters that would affect the resulting semantic space should be expressed as a part of this name. -
getVector(String word)
- Returns the semantic vector for the provided word. -
getVectorLength()
- Returns the length of the semantic vector. -
getWords()
- Returns the set of words that are represented in this semantic space. -
processDocument(BufferedReader document)
- Processes the contents of the provided file as a document. The algorithm is free to interpret the contents of the reader as it sees fit. For example, if the algorithm is specifically designed for [Wikipedia] (http://en.wikipedia.org/) , then it should be assumed that the reader will contain the expected Wikipedia contents. -
processSpace(Properties properties)
- Once all the documents have been processed, performs any post-processing steps on the data. For example, in [LSA] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis) , after all of the documents have been processed, this method is called to perform the matrix normalization and [SVD] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition).
The getWords()
and getVector()
methods are designed with the intent that an algorithm can generate vectors for words on the fly. Instead of keeping all of the vectors in memory at once (or on disk), the algorithm may instead generate a full vector from an internal and possibly compressed or sparse representation.
We follow the convention that an algorithm should be designed as a library, rather than as an executable. For example, we would expect that any algorithm could be used as follows:
pubic void myFunction(Collection<BufferedReader> documents) {
// create the Algorithm, which should initialize any of its
// required resources
MySSpaceAlgorithm algo = new MySSpaceAlgorithm();
// Process each of the documents in the corpus. The algorithm
// should incrementally build its semantic space
for (BufferedReader document : documents)
algo.processDocument(document);
// The algorithm needs to do any post-processing once the
// entire corpus has been seen. The Properties argument
// allows us to specify any optional configuration/processing
// parameters that the algorithm defines.
algo.processSpace(System.getProperties());
// This loop simply prints out the vectors, which isn't that
// informative, but still shows how the vectors are used. A
// better use of the algorithm would be call SemanticSpaceUtils
// and write the semantic space to disk. Alternately, we could
// use Similarity.java or a class from the evaluation package
// to compare vectors at this point.
for (String word : algo.getWords())
System.out.printf("%s maps to %s%n", word,
VectorIO.toString(algo.getVector(word)));
For this reason, we have two files for each algorithm: one that contains all of the algorithm's code for creating the semantic space, and a second file that contains code for running that algorithm as a stand-alone program. For example, see [LatentSemanticAnalysis.java] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/lsa/LatentSemanticAnalysis.html) is paired with [LSAMain.java] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/mains/LSAMain.html).
For an introduction to the various libraries the S-Space package offers, see the [code overview] (/fozziethebeat/S-Space/wiki/PackageLayout) page.
Several of the existing algorithms have stand alone executable programs in the edu.ucla.sspace.mains
package. For example, to run [LSA] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis), you could do
user@machine$ java -cp classes edu.ucla.sspace.mains.LSAMain
usage: java LSAMain [options] <output-dir>
Options:
-n, --dimensions=INT the number of dimensions in the semantic space
-p, --preprocess=CLASSNAME a MatrixTransform class to use for preprocessing
Required (at least one of):
-d, --docFile=FILE[,FILE...] a file where each line is a document
-f, --fileList=FILE[,FILE...] a list of document files
Program Options:
-o, --outputFormat={text|binary} the .sspace format to use
-t, --threads=INT the number of threads to use
-v, --verbose prints verbose output
-w, --overwrite=BOOL specifies whether to overwrite the existing output
Which will simply print out a usage menu providing further instructions for the expected command line arguments.
The algorithm you use should be based on the semantics you are trying to capture. For example, LSA is well-established at capturing synonymous relationships between words; in the LSA space, words with similar meanings have similar vectoral representations.
In general, many semantic spaces require significant time or space resources. Users should be aware of these when running with the Java virtual machine. We recommend running with the -server
option to use the [Server JVM] (http://java.sun.com/products/hotspot/whitepaper.html). In addition, if you have the memory, we recommend running with the maximum available memory on the system. For example, if you have 8 gigabytes, you would add -Xmx8g
to the java
command. See your system's java
documentation for more information on manually setting the heap size.
One of the most important aspects of generating a semantic space (besides the algorithm) is the corpus itself. In general, the more words you can provide, the more contexts you will have for each word. The additional context create a more representative picture of what words commonly co-occur. Formally speak, the additional data better approximates the contextual distribution of a word.
Many algorithms have certain constraints on how they process a corpus. For example, [LSA] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis) requires that the corpus be segmented into documents. On the other hand, [Random Indexing] (/fozziethebeat/S-Space/wiki/RandomIndexing) and [COALS] (/fozziethebeat/S-Space/wiki/Coals) have no such restricting and could process the entire corpus at once as a giant stream of words.
There are many commonly used corpora, some free and other commercial. If you don't have access to a corpus, we recommend the following for immediately getting started.
Free
- [The WaCKy Corpora] (http://wacky.sslmit.unibo.it/doku.php?id=corpora) - A large collection of blog postings that is available in several languages besides English.
- [American National Corpus] (http://americannationalcorpus.org/) - 15 million words of American English with part of speech tags
- [The USENET Corpus] (http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html) - a corpus of 20 billion mostly English words from USENET posts, provided by Westbury Lab. (22 gigabytes, compressed)
- [Project Gutenberg] (http://www.gutenberg.org/) - A collection of public domain books and other written works. (Note that you will have to compile the corpus yourself by downloading the books.)
- [Wikipedia snapshots] (http://en.wikipedia.org/wiki/Wikipedia_database) - text and Wiki markup for all Wikipedia pages.
Non-Free
- [British National Corpus] (http://www.natcorp.ox.ac.uk/) - This is an excellent quality corpus and has accompanying [word-sense disambiguation] (http://en.wikipedia.org/wiki/Word_sense_disambiguation) information.
- [GOV2] (http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html) - One of the largest corpora (426 gigabytes)
This list is in no way comprehensive, and is just meant as a starting point for your search. We recommend searching for other corpora. Many other topical corpora exist (e.g. medical abstracts, technical reports), which may be more suited to certain semantic space algorithms. You can also build your own corpus by crawling the web (tools exist to do this for you as well.) If you have other corpora you think we should include in this list, feel free to [let us know] (mailto:[email protected])
Many corpora will need some form of preprocessing before the text data is fit to be used with a semantic space. Commonly used steps are:
- Removing punctuation
- Separating punctuation from words
Oh, I'm hungry.
becomesOh , I ' m hungry .
- Removing HTML, email address, non-word text
- Converting all words to lower (or upper) case
- [Stemming] (http://en.wikipedia.org/wiki/Stemming) all words
- Breaking a corpus up into documents
- Removing misspelled words
- Removing words not in the target language (e.g. all non-English words)
- Spelling correction
Not all of these steps will be used at once. We encourage users to look at their corpus and see what processing will be necessary. For example, it is a good idea to remove all of the Wiki markup in the Wikipedia snapshots.
Last, the corpus needs to be converted into a format that the S-Space package can read in. We currently support three methods for reading corpora:
- a single corpus file, the entire corpus is contained in a single file with one document per line
- one file per document, a list is then generated for each file (document) in the corpus
- a custom corpus reader, for each file specified, the reader will be asked to generate an iterator over documents.
See the [file formats] (/fozziethebeat/S-Space/wiki/FileFormats) wiki page for full details. Briefly, the first option is preferred for improved I/O efficiency. However, if your corpus is already in document format, the second input format is easy to generate, but may have worse performance for large corpora.
Once a algorithm has processed all of the documents, the semantic space can be serialized to a file using SemanticSpaceUtils.printSemanticSpace()
. This creates an .sspace
file containing all the words and their associated vectors. Creating a serialized form of the semantic space has several advantages:
- the semantic space only needs to be generated once
- different vector comparison methods (e.g. cosine similarity, correlation) can be tested on the vectors
- the serialized form can be used in evaluation benchmarks
- the same algorithm can be used to generate multiple serialized semantic spaces by adjusting the available parameters, or using different corpora. The last advantage is most notable, as it lets the user compare how the semantics change for a single algorithm as other factors are varied.
Serialized semantic spaces can be read back in as SemanticSpace
objects using the FileBasedSemanticSpace
class or SemanticSpaceUtils.loadSemanticSpace()
method. The resulting spaces are read-only, and are intended only for testing purposes.
The S-Space package supports both human readable and binary .sspace
formats. See the [file formats] (/fozziethebeat/S-Space/wiki/FileFormats) wiki page for full details.
A modified semantic space can be compared against the original. This may be done to ensure that dimension reduction, or other modifications have not greatly altered the meaning of words. One way to accomplish this is to measure the average word similarity between the original and modified vectors. Several [similarity measures] (/fozziethebeat/S-Space/wiki/SimilarityFunctions) have been implemented in the [Similarity] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/common/Similarity.html) class. Iterating through the SemanticSpace
word collection (obtained using getWords()
) and measuring each similarity:
double measure_similarity(SemanticSpace before, SemanticSpace after){
Set<String> words = after.getWords();
double sum=0.0;
Iterator<String> i = words.iterator();
while(i.hasNext()){
String word = i.next();
sum += Similarity.getSimilarity(Similarity.SimType.COSINE,
before.getVector(word),
after.getVector(word));
}
return sum/words.size();
}
Generally you are not concerned with the entire collection of words in the sspace, but only the well defined ones (frequently used, non stop words). To manually inspect an sspace try using the Semantic Space Explorer.
- When testing your out an algorithm, use at least 100 test documents with a variety of words
- For algorithms that reduce the dimensionality, like LSA, set the number of dimensions to be less than the number of words and the number of documents in your test corpus.