Skip to content
David Jurgens edited this page Apr 8, 2017 · 3 revisions

File formats

Introduction

The command-line executable programs produce semantic spaces and save them as .sspace files. We use four file formats: binary and text, and their sparse equivalents. The default for all command line programs is to produce a sparse binary output.

The sparse versions should be used if the algorithm produces semantic vectors in which more than half of the values are 0 (e.g. RandomIndexing). The sparse versions are much more compact for these types of semantic spaces and will be both faster to read and write as well as be much smaller on disk.

text format

Each word and vector is represented on its own line. A word is delineated from its vector by a | character. The vector values are space delimited. For example, a semantic space with 3 words and 4 dimensions would look like:

4 4
red|0 1 0 0
blue|1 0 0 1
green|0 0 1 0
purple|1 1 0 1

Any white space in the original word, e.g. "white house", is preserved in the .sspace output. The vectors values will be written as double-precision floating point numbers.

The text format is convenient for processing small spaces and verifying that the vectors are what is expected (or at the very least, non-zero). This format is also very portable and is easily loaded by other programming languages, should that be required.

binary format

In the binary format, the entire semantic space is written in one continuous stream. The format is specified as:

  1. an eight byte header consisting of two ints (four bytes each; high byte first) for the number of words in the space, and the number of dimensions in the space.
  2. Then each word in the space is appended in the following format. * The word is encoded in a Modified UTF-8 (see here for more information on the encoding format). * Each of the values for the vector dimensions is appended as an eight-byte double (high byte first).

This format is preferable for faster serializing and deserializing. Furthermore, .sspace files in this format are smaller compared to the full text version, so this format is preferred if the semantic space is large, or if disk space is an issue.

We recommend that large, dense semantic spaces uses this format.

sparse text

Each word and vector is represented on its own line. A word is delineated from its vector by a | character. In the sparse format, only the non-zero vector values are printed. The values are printed with their respective index first. For example:

4 300
red|40 1 60 4
blue|1 3 299 1
green|3 -1 10 90
purple|200 1 201 1

red has two non-zero values at indexes 40 and 60. Any white space in the original word, e.g. "white house", is preserved in the .sspace output. The vectors values will be written as double-precision floating point numbers.

sparse binary

In the binary format, the entire semantic space is written in one continuous stream. The format is specified as:

  1. an eight byte header consisting of two ints (four bytes each; high byte first) for the number of words in the space, and the number of dimensions in the space.
  2. Then each word in the space is appended in the following format. * The word is encoded in a Modified UTF-8 (see [here] (http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8) for more information on the encoding format). * A four-byte integer indicating the number of non-zero values in the vector * Each of the non-zero values for the vector dimensions is appended as a four-byte integer indicating the index of the value and an eight-byte double (high byte first).

This format is preferable for sparse semantic spaces as it is faster for serializing and deserializing. Furthermore, .sspace files in this format are much smaller, so this format is preferred if the semantic space is large, or if disk space is an issue.

We recommend that large, dense semantic spaces uses this format.

In the future we may also support sparse formats, as demand or available time increases.

Input Document File Formats

All of the command line programs use one of two input file formats for creating the semantic space: document-file format and the file-list format. In general, each document is treated as a white-space separated sequence of tokens. In the default configuration of many algorithms, no attempt is made to perform n-gram grouping or to separate punctuation from words, e.g. "this," becomes "this" and ",". If this behavior is desired, see the various Tokenizing options.

Document File Format

The document file format is a condensed format where an entire corpus is represented in a single file. Each document in the corpus is represented on a single line. The entire line is treated as a document regardless of the tokens contained therein. Furthermore, the format does not support labeling a specific line as a document. If the document file contains special tokens for its label that are not intended to be included, then these could be filtered out using one of the [Tokenizing filtering] options.

We prefer this format because it greatly reduced the I/O access for processing an entire corpus. When the file system needs to access each document as a separate file, the cost of opening, seeking to, and closing each file can significantly outweigh the actual semantic space processing time.

As an added benefit, this file format also lets semantic space users experiment with different levels of "document" granularity for co-occurrence models that don't normally work with documents. Specifically, the "document" in this case defines a boundary across which co-occurrence are not counted. For example, in the sentences, "The software is written in Java. The programs are portable," if each sentence was counted as a separate document, then "Java" and "programs" would not count as co-occurring.

File List Format

In the file list format, each document is stored in a separate file. The files are then combined into a list with one per line. For example:

$ cat my-file-list.txt
file-1.txt
/tmp/downloaded-file.txt
corpus/my-doc.txt

my-file-list.txt contains three documents.

This format is convenient if the documents are already represented by files, and you just want to try generating a space. (e.g. ls -1 use to generate the file list) However for most systems, the execution time will grow significantly as the corpus size grows.

Matrix File Formats

The S-Space package supports reading and writing several matrix file formats. Among those supported are