Scallop is an accurate reference-based transcript assembler. Scallop features its high accuracy in assembling multi-exon transcripts as well as lowly expressed transcripts. Scallop achieves this improvement through a novel algorithm that can be proved preserving all phasing paths from reads and paired-end reads, while also achieves both transcripts parsimony and coverage deviation minimization.
Scallop paper has been published at Nature Biotechnology. The datasets and scripts used in this paper to compare the performance of Scallop and other assemblers are available at scalloptest.
Please also checkout the podcast about Scallop (thanks Roman Cheplyaka for the interview). It is available at both the bioinformatics chat and iTunes.
Latest release of Scallop is v0.10.5, including binary (for both linux and mac) and source code.
Below we list the systems that have been tested for whether the Scallop binary can run or not.
Operation System | Version | Code Name | Scallop |
---|---|---|---|
Debian | 9 | Stretch | linux |
Ubuntu | 14.04 | Trusty Tahr | linux |
Ubuntu | 16.04 | Xenial Xerus | linux |
CentOS | 6.9 | N/A | |
CentOS | 7 | linux | |
Fedora | 24 | linux | |
Mac OS | 10.10 | Yosemite | mac |
Mac OS | 10.11 | El Capitan | mac |
Mac OS | 10.12 | Sierra | mac |
Scallop is, and will continue to be, freely and actively supported on a best-effort basis.
If you need industrial-grade technical support, please consider the options at oceangenomics.com/support.
Download the source code of Scallop from
here.
Scallop uses additional libraries of Boost and htslib (NOTE: from v0.10.4 the dependence on Clp is optional).
If they have not been installed in your system, you first
need to download and install them. You might also need to
export the runtime library path to certain environmental
variable (for example, LD_LIBRARY_PATH
, for most linux distributions).
After install these dependencies, you then compile the source code of Scallop.
If some of the above dependencies are not installed to the default system
directories (for example, /usr/local
, for most linux distributions),
their corresponding installing paths should be specified to configure
of Scallop.
If Boost has not been downloaded/installed, download Boost (license) from (http://www.boost.org). Uncompress it somewhere (compiling and installing are not necessary).
If htslib has not been installed, download htslib (license) from (http://www.htslib.org/) with version 1.5 or higher. Note that htslib relies on zlib. So if zlib has not been installed in your system, you need to install zlib first. To do so, download zlib (license) at (https://zlib.net/). Use the following commands to install zlib:
./configure
make
make install
After installing zlib, use the following commands to build htslib:
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no
make
make install
The default installation location of htslib is /usr/lib
.
If you would install it to a different location, replace the above configure
line with
the following (by adding --prefix=/path/to/your/htslib
to the end):
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no --prefix=/path/to/your/htslib
In this case, you also need to export the runtime library path (note that there
is an additional lib
following the installation path):
export LD_LIBRARY_PATH=/path/to/your/htslib/lib:$LD_LIBRARY_PATH
NOTE: Clp will be used to solve the linear programming instances created when decomposing unsplitable vertices. An alternative algorithm is provided in Scallop from version v0.10.4 (and hence since then the installation of Clp becomes optional). Our testing shows that these two algorithms give very similar results.
If Clp has not been installed in your system, download Clp (license) from (https://projects.coin-or.org/Clp). Use the following to install Clp
./configure --disable-bzlib --disable-zlib
make
make install
The default installation of Clp is the current directory, rather than /usr/lib
.
If you would install it to a different location, replace the above configure
line with
the following (by adding --prefix=/path/to/your/Clp
to the end):
./configure --disable-bzlib --disable-zlib --prefix=/path/to/your/Clp
You need to export the runtime library path (note that there
is an additional lib
following the installation path):
export LD_LIBRARY_PATH=/path/to/your/Clp/lib:$LD_LIBRARY_PATH
Use the following to compile Scallop (without Clp; therefore the alternative algorithm for decomposing unsplitable vertices will be used; available for versions newer than v0.10.4):
./configure --with-htslib=/path/to/your/htslib --with-boost=/path/to/your/boost
make
Use the following to compile Scallop (with Clp; therefore an linear programming formulation will be used to decompose unsplitable vertices):
./configure --with-htslib=/path/to/your/htslib --with-boost=/path/to/your/boost --enable-useclp --with-clp=/path/to/your/Clp
make
If some of the dependencies are installed in the default system directory (for example, /usr/lib
),
then the corresponding --with-
option might not be necessary.
The executable file scallop
will appear at src/scallop
.
The usage of scallop
is:
./scallop -i <input.bam> -o <output.gtf> [options]
The input.bam
is the read alignment file generated by some RNA-seq aligner, (for example, TopHat2, STAR, or HISAT2).
Make sure that it is sorted; otherwise run samtools
to sort it:
samtools sort input.bam > input.sort.bam
The reconstructed transcripts shall be written as gtf format into output.gtf
.
Scallop support the following parameters. Please refer to the additional explanation below the table.
Parameters | Default Value | Description |
---|---|---|
--help | print usage of Scallop and exit | |
--version | print version of Scallop and exit | |
--preview | show the inferred library_type and exit |
|
--verbose | 1 | chosen from {0, 1, 2} |
--library_type | empty | chosen from {empty, unstranded, first, second} |
--min_transcript_coverage | 1 | the minimum coverage required to output a multi-exon transcript |
--min_single_exon_coverage | 20 | the minimum coverage required to output a single-exon transcript |
--min_transcript_length_base | 150 | the minimum base length of a transcript |
--min_transcript_length_increase | 50 | the minimum increased length of a transcript with each additional exon |
--min_mapping_quality | 1 | ignore reads with mapping quality less than this value |
--max_num_cigar | 7 | ignore reads with CIGAR size larger than this value |
--min_bundle_gap | 50 | the minimum distances required to start a new bundle |
--min_num_hits_in_bundle | 20 | the minimum number of reads required in a bundle |
--min_flank_length | 3 | the minimum match length required in each side for a spliced read |
--min_splice_bundary_hits | 1 | the minimum number of spliced reads required to support a junction |
-
For
--verbose
, 0: quiet; 1: one line for each splice graph; 2: details of graph decomposition. -
--library_type
is highly recommended to provide. Theunstranded
,first
, andsecond
correspond tofr-unstranded
,fr-firststrand
, andfr-secondstrand
used in standard Illumina sequencing libraries. If none of them is given, i.e., it isempty
by default, then Scallop will try to infer thelibrary_type
by itself (see--preview
). Notice that such inference is based on theXS
tag stored in the inputbam
file. If the inputbam
file do not containXS
tag, then it is essential to provide thelibrary_type
to Scallop. You can try--preview
to see the inferredlibrary_type
. -
--min_transcript_coverage
is used to filter lowly expressed transcripts: Scallop will filter out transcripts whose (predicted) raw counts (number of moleculars) is less than this number. -
--min_transcript_length_base
and--min_transcript_length_increase
is combined to filter short transcripts: the minimum length of a transcript is given by--min_transcript_length_base
+--min_transcript_length_increase
* num-of-exons-in-this-transcript. Transcripts that are less than this number will be filtered out.
We recommend users to perform RNA-seq quantification using the combination of Scallop and Salmon. This pipeline involves the following steps:
Step 1: Align the reads to a reference genome (for example, with
TopHat2,
STAR, or
HISAT2)
to obtain the (sorted) reads alignment file sort.bam
.
Step 2: Assemble the expressed transcripts with Scallop:
scallop -i sort.bam -o scallop.gtf
The assembled transcripts will be written to scallop.gtf
.
Step 3: Use gffcompare to evaluate the assembled transcripts using a reference annotation:
gffcompare -o gffall -r reference.gtf scallop.gtf
where reference.gtf
is the reference annotation file
(for example, ensembl annotation).
This command will generate a file gffall.scallop.gtf.map
defining which transcripts in scallop.gtf
can be found in the reference.gtf
.
Step 4: Union the assembled transcripts with the reference transcriptome. Specifically,
First, use our tool
gtfcuff to fetch the transcripts that
are only in scallop.gtf
:
gtfcuff puniq gffall.scallop.gtf.tmap scallop.gtf reference.gtf unique.gtf
The uniquely expressed transcripts (i.e., those are in scallop.gtf
but not in reference.gtf
)
will be written to unique.gtf
.
Second, extract the cDNA sequences of the transcripts in unique.gtf
from a reference genome using tool
gffread:
gffread unique.gtf -g genome -w unique.fa
where genome
is the reference genome, for example
ensembl reference genome.
The cDNA sequences of the uniquely assembled transcripts (i.e., those in unique.gtf
)
will be written to unique.fa
.
Finally, merge unique.fa
and the reference transcriptome to obtained the extended transcriptome:
cat unique.fa reference.fa > union.fa
where reference.fa
is the reference transcriptome (i.e., the cDNA sequences of the
transcripts in reference.gtf
), for example,
ensembl cDNA sequences.
The extended transcriptome will be written to union.fa
.
Step 5: Run Salmon to quantify with respect to the above extended transcriptome. First, create Salmon index:
salmon index -t union.fa -i salmon.index -p 4
After that we can quantify:
salmon quant -i salmon.index -1 fastq-file1 -2 fastq-file2 -p 4
The main quantification file will appear as salmon.quant/quant.sf
.
Please refer to Salmon documentation
for more advanced usage.