quip.1

.TH QUIP 1 "June 17, 2012" "Quip"
.SH NAME
quip, unquip, quipcat \- compress and decompress sequencing data
.SH SYNOPSIS
.PP
quip [\f[I]option\f[]]... [\f[I]file\f[]]...
.SH DESCRIPTION
.PP
.I Quip
compresses next-generation sequencing data with extreme prejudice. It supports
input and output in the FASTQ and SAM/BAM formats.
.PP
When invoked, \f[I]quip\f[] will compress a file and write the compressed data to a
new file with the suffix \f[I].qp\f[] added, leaving the original file intact
(contrary to the default gzip/bzi2 behavior). Similarly, \f[I]uquip\f[] will take a
file compressed with quip and write uncompressed output to a new file with the
\f[I].qp\f[] suffix removed.
.PP
Input format is automatically detected if a file is given. If no input file is
provided, data is read from standard input, but in this case the input format
will not be guessed and is assumed to be FASTQ unless otherwise specified (i.e.
with the \f[I]--input\f[I] option).
.PP
Input and output formats can be provided explicitly with the \f[I]--input\f[]
and \f[I]--output\f[] options.
.PP
Quip will not automatically clobber output files unless the \f[I]--force\f[]
option is supplied.
.SH OPTIONS
.TP
.B \-i, --input=FORMAT, --from=FORMAT
Format of the input file. By default the file is examined and the format is
guessed. If input in stdin, the format will not be guessed and should be
provided.
.TP
.B \-o, --output=FORMAT, --to=FORMAT
Output format. If not provided, this is guess from the filename. For example,
a file named \f[I]reads.fastq.qp\f[] will be output in FASTQ format.
.TP
.B \-d, --decompress, --uncompress
Decompress a compressed FASTQ file. Compression is the default, unless the
program is invoked as \f[I]unquip\f[]. This is equivalent to
.B --input=quip
.TP
.B \-r, --reference=genome.fasta
Perform reference-based compression of aligned reads, with the given reference
sequence in FASTA format.
.TP
.B \-a, --assembly
Perform assembly-based compression of unaligned reads.
.TP
.B \-n, --assembly-n=N
Assemble the first N reads. This implies \f[B]--assembly\f[]. (default:
2500000)
.TP
.B \-t, --test
Test the integrity of the archive by performing a dry-run decompression and
verifying checksums along the way.
.TP
.B \-l, --list
Print some summary information about the archive, including number of reads,
number of bases, and compression ratio.
.TP
.B \-c, --stdout
Print output to standard out.
.TP
.B \-f, --force
Force compression or decompression even if this would involve writing binary
data to a terminal, clobbering an existing file, or decompressing a file
that does not have a \f[I].qp\f[] suffix.
.TP
.B \-v, --verbose
While running, print a great deal of information that is of no use unless you
are debugging Quip or just curious about how it works.
.TP
.B \-V, --version
Print some information about the program version.
.TP
.B \-h, --help
Print brief usage instructions.

.SH SUPPORTED FORMATS
In the current version of Quip, FORMAT can be any one of \f[B]quip\f[],
\f[B]fastq\f[], \f[B]sam\f[], \f[B]bam\f[].
.PP
Input in FASTQ format may also be compressed with gzip or bzip2. Running quip on
a file \f[I]reads.fastq.gz\f[I] will produce a new file \f[I]reads.fastq.qp\f[I]
that's about half the size, and similarly for bzip2.

.SH OPTIMIZING COMPRESSION
.PP
Quip will work on all sequencing data in the FASTQ format, but for those concerned
about maximizing the level of compression, some simple advice can be followed when
in comes to the input data.
.PP
Most importantly, data from separate sources should never be concatenated and
compressed into one file. Quip assembles reads into contigs and builds a
statistical model that tightly fits the data. Mixing data from multiple
samples or sequencing protocols could drastically diminish the effectiveness
of both of these strategies. If a single archive file is desired, we recommend
running quip separately on each of the files then using the Unix \f[I]tar\f[]
command to combine these compressed files into one archive. (Note: There is no
added benefit to subsequently compressing this tar file with gzip/bzip2.)
.PP
Often with data in the FASTQ format, a significant amount of space is used by
highly redundant read identifiers. Quip works hard to locate and compress out
this redundancy, drastically reducing the size of IDs, so much so that it is
not necessary or advisable to remove IDs, but their formatting can make a
significant difference.
.PP
If numbers are used in the read IDs, compression will benefit greatly if they
are sequentially increasing in a predictable way. Moreover, it is best not to
pad numbers with leading zeros or spaces. Separate pieces of information in
the ID (e.g., machine name, tile coordinates) should ideally be delimited by a
non-alphanumeric character, such as a space or comma, so that they can be
compressed using separate models.
.PP
Some have proposed transformations to FASTQ data to increase the compression
ratio. These are largely targeted towards optimizing compression in programs
using the Lempel-Ziv algorithm, such as gzip. Quip is not one these
programs, and will not benefit from these transformations. The algorithms
used in Quip are highly tuned to work with unadulterated, untransformed
FASTQ data.
.PP
Along similar lines, sorting or reordering reads is unlikely to improve
the compression ratio, and is not recommended.

.SH BUGS AND LIMITATIONS
.PP
Quip will not compress AB SOLiD colorspace data.
.PP
Quip will deduce the quality score scale used, but does not support quality
score schemes that span a range larger than 64. At the time of writing, such
systems are quite rare.
.PP
In certain versions of FASTQ, the read ID is printed twice: before the sequence
and before the quality scores. Quip will read data in this format, but will
ignore the second listing of the read ID.
.PP
The order of optional fields in SAM/BAM files will not be preserved.

.SH AUTHOR
Quip written by Daniel C. Jones <dcjones@cs.washington.edu>.
Bugs should preferably be reported at: https://github.com/dcjones/quip/issues