GitHub - shahsam/ShortStack: ShortStack: Comprehensive annotation and quantification of small RNA genes

shahsam / ShortStack Public
forked from MikeAxtell/ShortStack
Notifications You must be signed in to change notification settings
Fork 0
Star 0
ShortStack: Comprehensive annotation and quantification of small RNA genes
GPL-3.0 license
0 stars 31 forks Branches Tags Activity
Star
Notifications
Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LICENSE		LICENSE
README		README
README-bam2wig		README-bam2wig
README-butter		README-butter
README-maple		README-maple
ShortStack		ShortStack
bam2wig		bam2wig
butter		butter
maple		maple
Repository files navigation

LICENSE
    ShortStack

    Copyright (C) 2012-2014 Michael J. Axtell

    This program is free software: you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation, either version 3 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program. If not, see <http://www.gnu.org/licenses/>.

SYNOPSIS
    Annotation and quantification of small RNA genes based upon
    reference-aligned small RNA sequences

CITATION
    If you use ShortStack in your work, please cite

    Axtell MJ. (2013) ShortStack: Comprehensive annotation and
    quantification of small RNA genes. RNA 19:740-751.
    doi:10.1261/rna.035279.112

    Shahid S., Axtell MJ. (2013) Identification and annotation of small RNA
    genes using ShortStack. Methods doi:10.1016/j.ymeth.2013.10.004

AUTHOR
    Michael J. Axtell, Penn State University, [email protected]

DEPENDENCIES
    perl, samtools, RNAfold, RNAplot, butter, bowtie, bowtie-build, maple

    ShortStack is a perl5 script, so it needs perl5 to compile. It expects
    to find perl5 at /usr/bin/perl. If this is not where your perl is,
    modify line 1 of ShortStack (the hashbang) accordingly. It also requires
    the package Getopt::Long, which I think is standard in most Perl
    distributions. If this package is not installed, get it from CPAN.

    samtools <http://samtools.sourceforge.net/> needs to be installed in
    your PATH. ShortStack was developed using samtools 0.1.18. Other
    versions should be OK as far as I know, but let me know if not!

    RNAfold and RNAplot are from the ViennaRNA package. See
    <http://www.tbi.univie.ac.at/~ronny/RNA/vrna2.html>. Both need to be
    installed in your PATH.

    butter (bowtie using iterative placement of repetitive small RNAs) ships
    with ShortStack, and is also at https://github.com/MikeAxtell/butter .
    It must be installed in your PATH.

    bowtie and bowtie-build must be version "1" .. either 0.12.x or 1.x.
    These are required ONLY if you are aligning reads to the genome. Bowtie
    can be found at http://bowtie-bio.sourceforge.net/index.shtml. Like the
    other dependencies, they must be in your PATH.

    maple (microRNA analysis program leveraging expression) ships with
    ShortStack, and is also at https://github.com/MikeAxtell/maple . It must
    be installed in your PATH.

OPTIONAL DEPENDENCIES
    bam2wig, wigToBigWig

    If using ShortStack to align your small RNA-seq data, wiggle and bigwig
    files summarizing the coverage of the reads can be automatically created
    if bam2wig and wigToBigWig are installed on your system

    bam2wig is a perl script included with the ShortStack package, and is
    also available at https://github.com/MikeAxtell/bam2wig It must be
    installed in your PATH.

    wigToBigWig is available from UCSC at
    http://genome.ucsc.edu/goldenPath/help/bigWig.html It must be installed
    in your PATH

    ShortStack will run just fine without bam2wig and/or wigToBigWig. But,
    if you are aligning reads, it will give you a warning.

    If a wiggle and/or bigwig file are created, the track will represent
    total depth of coverage on both strands. You can use bam2wig to make
    different types of tracks with more flexible options.

INSTALL
    There is no 'real' installation. After installing the dependencies (see
    above), you should check to make sure the ShortStack is executable. It
    should be, but if not you can:

        chmod +x ShortStack
                                                                                                 
    Then add it to your PATH .. for instance

    sudo cp ShortStack /usr/bin/

    Depending on your system, you may need administrative privileges to
    install the dependencies and/or to add ShortStack to you PATH

USAGE
    ShortStack [options] [genome.fasta]

    There are three modes which differ in the the types of pre-analysis that
    are performed. Each of the modes has a different set of REQUIRED
    options:

    Mode 1: Trim small RNA-seq reads to remove 3' adapter seqeuence, align
    them, and then analyze. Required options:

    --reads

    --adapter

    Mode 2: Align pre-trimmed small RNA-seq reads, and then analyze.
    Required option:

    --reads

    Mode 3: Analyze a pre-existing BAM alignment of small RNA-seq reads.
    Required option:

    --bamfile

    Additionally, in modes 1 or 2, the option --align_only will terminate
    analysis after making the alignment file.

OPTIONS
    --help : Print a help message and then quit.

    --version : Print the version number and then quit.

    --outdir [string] : Name of directory to be created to receive results
    of the run. Deafults to \"ShortStack_[time]\", where time is \"UNIX
    time\" (the number of non-leap seconds since Jan 1, 1970 UCT), if not
    provided

    --adapter [string] : Sequence of 3' adapter to search for during adapter
    trimming. Must be at least 8 nts in length, and all ATGC characters. If
    provided, reads will be trimmed.

    --reads [string] : Path to reads file in fasta (.fa or .fasta
    extension), fastq (.fastq or .fq extension), or colorspace-fasta
    (.csfasta extension). Can be multiple files, separated by commas.
    ShortStack knows the format only thought the file extensions.

    --bowtie_cores [integer] : Number of processor cores to use during
    bowtie / butter alignment. Default: 1.

    --mismatches [integer] : Number of allowable mismatched for butter
    alignment. Must be 0 or 1. Default: 0.

    --max_rep [integer] : Reads with more than this number of possible
    alignment positions will be reported as unmapped regardless of butter
    density placement probabilities. Default: 1000.

    --ranmax [integer] : Reads with more than this number of possible
    alignment positions where the choice can't be guided by
    butter-calculated probabilities will be reported as unmapped. Default:
    3.

    --HPscore : Minimum maple-derived score in order to keep a locus that
    failed as a MIRNA as an HP locus. Deafult: 0.9

    --align_only : Exits program after completion of small RNA-seq data
    alignment, creating BAM file.

    --bamfile [string] : Path to properly formatted and sorted BAM alignment
    file of small RNA-seq data. Files require custom tags provided by the
    butter aligner.

    --read_group [string] : Analyze only the indicated read-group.
    Read-group must be specified in the bam alignment file header. Default =
    [not active -- all reads analyzed]

    --flag_file [string] : PATH to a simple file of genomic loci of
    interest. The ShortStack-analyzed small RNA clusters will be analyzed
    for overlap with the loci in the flag_file .. if there is any overlap
    (as little as one nt), it will be reported. Format for this file is
    describe below.

    --mindepth [integer] : Minimum depth of mapping coverage to define an
    'island'. Default = 20. Must be at least 2, more than 5 preferred.

    --pad [integer] : Number of nucleotides upstream and downstream to
    extend initial islands during cluster definition. Default = 100

    --dicermin [integer] : Smallest size in the Dicer size range (or size
    range of interest). Deafult = 20. Must be between 15 and 35, and less
    than or equal to --dicermax

    --dicermax [integer] : Largest size in the Dicer size range (or size
    range of interest). Deafult = 24. Must be between 15 and 35, and more
    than or equal to --dicermin

    --miRType [string] : Either \"plant\" or \"animal\". Defaults to
    \"plant\".

    --minstrandfrac [float] : Minimum fraction of mappings to one or the
    other strand call a polarity for non-hairpin clusters. Also the minimum
    fraction of \"non-dyad\" mappings to the sense strand within potential
    hairpins/miRNAs to keep the locus annotated as a hp or miRNA. See below
    for details. Default = 0.8. Allowed values between 0.5 and 1.

    --mindicerfrac [float] : Minimum fraction of mappings within Dicer size
    range to annotate a locus as Dicer-derived. Default = 0.85. Allowed
    values between 0 and 1.

    --phasesize [integer] : Examine phasing only for clusters dominated by
    the indicated size range. Size must be within the bounds described by
    --dicermin and --dicermax. Set to 'all' to examine p-values of each
    locus within the Dicer range, in its dominant size. Set to 'none' to
    suppress all phasing analysis. Default = 21. Allowed values between
    --dicermin and --dicermax.

    --count [string] : Invokes count mode, in which user-provided clusters
    are annotated and quantified instead of being defined de novo. When
    invoked, the file provided with --count is assumed to contain a simple
    list of clusters. Count mode also forces nohp mode. Formatting details
    below. Default : Not invoked.

    --nohp : If \"--nohp\" appears on the command line, it invokes running
    in \"no hairpin\" mode. RNA folding, hairpin annotation, and MIRNA
    annotation will be skipped (likely saving significant time). Note that
    --count mode forces --nohp mode as well. Default: Not invoked.

KEY FORMATTING REQUIREMENTS AND ASSUMPTIONS
  Input genome.fasta file
    It is critical that this be the precise genome to which the reads in the
    input .bam file were mapped. If it isn't, validation of the BAM
    alignment file will fail and the run will be aborted.

    Chromsome names that contain whitespace will be truncated; only the
    first string of non-white-space characters will be maintained. This
    applies both during alignment (where bowtie does this) and during the
    rest of the analysis (where samtools does this, when creating the .fai
    index file).

    If not already present, a .fai index file for the genome will be created
    using samtools faidx at the beginning of the run. As above, chromosome
    names will be automatically trimmed starting at the first white-space
    character, if present.

  Small RNA-seq reads.
    FASTA or FASTQ data should be devoid of comment lines and conform to
    FASTA or FASTQ specs. In addition, ShortStack assumes that each read
    will occupy a single line in the file. There is no support for
    paired-end reads. Colorspace-FASTA formatted data (from SOLiD) can have
    comment lines. Format is assumed to conform to colorspace-FASTA
    specifications (beginning with a nucleotide, followed by a string of
    colors [0,1,2,3] or ambiguity codes [.]. For SOLiD data, the quality
    values (_QV.qual files) are not usable as valid inputs.

  Input .bam file
    As of ShortStack 2.0.0, bam alignment files must have been created with
    the program butter. Validation of bam files requires:

    If you do make your own BAM alignments, outside of ShortStack, they must
    pass the following validation steps that are performed by ShortStack:

    1. The header must be present

    2. The sort order of the file must be 'coordinate', as indicated by the
    SO: tag in the header

    3. All of the chromosome names found in the header MUST also be found in
    the genome.fasta file

    4. The data lines must contain the custom XX, XY, and XZ tags added by
    butter.

    5. If the option --read_group is being used, the specified read_group
    must be mentioned in the header in an @RG line.

    The BAM file should be indexed and have the corresonding .bam.bai index
    file in the same path as the bamfile. However, this is not required to
    pass validation .. if the index is not found, it will be created during
    the run.

    Each mapped read must have the CIGAR string set (column 6 in the SAM
    specification) -- ShortStack determines the small RNA lengths by parsing
    the CIGAR string .. if any mappings (except unmapped reads, which are
    ignored) have "*" entered instead of a valid CIGAR string ShortStack
    will exit and complain.

  --count file
    If running in --count mode, the user-provided file is expected to be a
    simple text file containing a list of coordinates in the format :
    [Chr]:[start]-[stop], where Chr is defined in the genome file AND in the
    .bam file, and start and stop are one-based, inclusive. The same
    requirement for short, non-whitespaced chromosome names as discussed
    above holds true for input --count files. Comment lines, that begin with
    '#', are ignored. Tab-delimited files are also accepted, provided the
    first column has the coordinates. The second column in tab-delimted
    files is assumed to be the names of the clusters, and will be used
    accordingly. Any other columns in a tab-delimited input file are
    ignored.

    Importantly, the 'Results.txt' file produced by a previous ShortStack
    run can be used directly in subsequent runs in --count mode. This is
    useful when comparing identical intervals across multiple samples.

    Note that count mode also forces nohp mode.

  --flag_file
    Optional. This is a list of genomic loci to scan for overlap with one or
    more of the small RNA loci found/analyzed by ShortStack. Overlap of any
    length is reported. The format of the file is similar to that of the
    --count file: A tab-delimited text file with coordinates in the first
    column, and names in the second column. Unlike for --count files, names
    are required to be present in the second column for --flag_file.
    Coordinates must be in the format [Chr]:[start]-[stop], where Chr is
    defined in the genome file AND the .bam file, and start and stop are
    one-based, inclusive.

OUTPUT
  Results.txt
    This is a simple tab-delimited text file. The first line begins with a
    "#" (comment) sign, and then lists column headers. Each subsequent line
    describes the key traits of a single cluster.

    To import this into R, here's a tip to deal with the first line, which
    has the headers but begins with a "#" character.

        >results <- read.table("Results.txt", head=TRUE, sep="\t", comment.char="")

    Column 1: Locus : The genome-browser-friendly coordinates of the
    clusters. Coordinates are one-based, inclusive (e.g. Chr1:1-100 refers
    to a 100 nt interval beginning with nt 1 and ending with nt 100).

    Column 2: Name : Name of cluster. Unless the run was in --count mode and
    the input file of a priori clusters already had names, the names are
    arbitrarily designated as "Cluster_1", "Cluster_2", etc.

    Column 3: FlagOverlap : Name(s) of any loci from the flag_file that
    overlap with the cluster are listed. If there are two or more, they are
    comma-separated. If there were none, or no flag_file was provided, than
    a "." is present in this column instead.

    Column 4: Size : Size in nts of the locus.

    Column 5: MIRNA : Whether this cluster appears to be a MIRNA or not. If
    not, a "." is present. If it is a hairpin, but NOT qualified as a MIRNA,
    "HP" is indicated. MIRNAs are indicated by "MIRNA". If the run was in
    "--nohp" mode, than all entries in the column will be ".".

    Column 6: MIRNA_Score : Score of the locus via MIRNA analysis by maple.
    Ranges from 0-1, with 1 being the best. "NA" if locus was pre-excluded
    from maple analysis (because of excessive length, not coming from a
    clear single-strand, or DicerCall of N).

    Column 7: Strand : The predominant genomic strand from which the small
    RNA emanate. If ".", no strand was called.

    Column 8: Frac_Wat : Fraction of aligned reads to the Watson (e.g. +)
    strand of the cluster. 1 means all were from Watson Strand (e.g. +), 0
    means all were from Crick (e.g. -) strand.

    Column 9: Total : Total aligned reads within the cluster.

    Column 10: Uniques : Total aligned reads derived from uniquely mapped
    reads .. e.g., those with XX:i:1.

    Column 11: DicerCall : If "N", the cluster was not annotated as
    dicer-derived, per options --dicermin, --dicermax, and --mindicerfrac.
    Otherwise this is a number, within the --dicermin to --dicermax size
    range, which indicates the most abundant small RNA size within the
    mappings at that cluster.

    Column 12: PhaseOffset : If "NA", phasing p-value was not calculated for
    this cluster. Otherwise, the offset is the one-based genomic position
    with which the cluster appears to be "in-phase" (based on the 5' nt of a
    sense-mapped small RNA). Phasing is always in increments identicial to
    the Dicer size call in column 9.

    Column 13: Phase_pval : If "NA", phasing p-value was not calculated for
    this cluster. Otherwise, the p-value is derived from a modified
    hypergeometric distribution, as described below.

    Column 14: Short : The total mappings from reads with lengths less than
    --dicermin.

    Column 15: Long : The total mappings from reads with lengths more than
    --dicermax.

    Columns 16 - the end : The total mappings from reads with the indicated
    lengths. These are the sizes within the Dicer range.

  Log.txt
    This is a simple log file which records the information that is also
    sent to STDERR during the run.

  gff3 files
    Two gff3-formatted files are created, one for the 'DCL' loci (those with
    a DicerCall that is NOT N), and the 'N' loci (those with a DicerCall of
    'N'). There are NOT produced in a --count mode run.

  Hairpin and MIRNA detail files
    Text-based and graphical files for each MIRNA and HP locus are created.
    See the maple documentation for details.

KEY METHODS
  Adapter trimming and alignments
    Adapter trimming and alignments are handled by butter. See the butter
    documentation for details.

  Multiple libraries
    ShortStack supports input of multiple small RNA-seq libraries through
    the option --reads. Files are provided as a comma-delimited list. When
    multiple small RNA-seq libraries are input, each one is first aligned
    separately, creating temporary .bam files. When this is complete, they
    are merged into a single alignment, which is given the name "outdir.bam"
    in the working directory, where "outdir" is the string given in option
    --outdir. During the merging, the read group information is stored, so
    all alignments can be de-convoluted back to their parent libraries if
    desired. The individual .bam alignments created intially are deleted.

  Read groups
    As of version 1.1.0, ShortStack incorporates the option --read_group.
    When specified, only the alignments from the read group specified by the
    option will be used for analysis. Use of this option demands that the
    indicated read group is specified in the header of the relevant bam
    alignment file.

    When an analysis uses a bam alignment file that contains more than one
    read group (based on the bam header), and the --read_group option was
    NOT used in the run, the analysis will conclude with a --count mode
    analysis of each read group separately. This is meant to be convenient
    for analyses in which a de-novo small RNA gene annotation is performed
    using a merger of multiple libraries, followed by quantification of each
    locus for each small RNA-seq library separately .. this should
    facilitate the downstream analysis of differential expression, for
    instance.

  de novo Cluster Discovery
    Cluster discovery proceeds in two simple steps:

    1. The total depth of small RNA coverage at each occupied nucleotide in
    the genome is examined, and initial 'islands' of coverage are defined as
    continuous stretches of non-zero coverage where the read depth, at at
    lest one point, is greater than or equal to the threshold depth
    specified by option --mindepth. Note that this definition of islands is
    different, and more inclusive, than that used by ShortStack versions
    prior to 2.0.0.

    2. The initial islands are then temporarily extended on both sides by
    the distance specified by option --pad. Islands that overlap after
    extension are merged. The "dangling pads" at the ends of the merged
    clusters are then removed. After all extensions, resultant mergers, and
    end trimmings are performed, the final result is the initial clusters.
    If the run is performed in --nohp mode, these are the final clusters. If
    hairpins and MIRNAs are being examined, some of the clusters may be
    adjusted in position to reflect the extent of the apparent hairpin
    precursor.

  Hairpin and MIRNA analysis
    MIRNA analysis is performed by maple. For speed, not all clusters are
    subject to analysis by maple. Clusters that exceed 1kb in length, have a
    DicerCall of "N", or that don't have a clear single-stranded pattern of
    read alignments are not sent to maple for analysis. The miRType option
    also limits queries based on kingdonm-specific requirement
    (specifically, if the miRTtype is 'animal', than the query can't be
    longer than 250 nts.).

    See the maple documentation for details of how it works.

  Quantification of clusters
    All mappings with at lease one nt of overlap within the cluster are
    tallied as being within the cluster. Thus, for a cluster located at
    Chr1:1000-2000, reads mapped to 980-1000, 1100-1123, and 2000-2021 are
    all counted as being within the cluster during quantification. Note that
    it's possible to count the same mapping within non-overlapping clusters.

  Analysis of Phasing
    'Phasing' describes the periodic mapping of small RNAs to repeating
    intervals equal to their size. It occurs when helical RNA is Diced
    processively from a defined terminus; often the terminus is defined by a
    prior small RNA slicing event followed by RDRP activity, although some
    MIRNA hairpins are also phased. Nearly all documented examples of phased
    small RNA production (in plants) occur for 21nt small RNAs in 21nt
    increments, hence the default settings of ShortStack to examine only
    21-dominated clusters. This can be changed with option --phasesize.

    ShortStack's basic method to identify phased small RNAs involves
    calculation of a p-value based on the hypergeometric distribution --
    this approach was inspired by Chen et al. (2007) PNAS 104: 3318-3323
    PMID: 17360645. However, ShortStack's method modifies the Chen et al.
    approach to make it more robust at detecting phasing in highly expressed
    clusters with a background of non-phased noise; the method also allows
    phasing analysis in any register within the dicer size range (controlled
    by option --phasesize), and analyzes regions of arbitrary length.
    Finally, ShortStack's analysis of phasing is "fuzzy" -- that it, exactly
    phased reads, and those +1 and -1 phase are all counted as "phased".

    Phasing analysis proceeds as follows:

    1. Clusters to be analyzed must be annotated as Dicer-derived and be
    dominated by the size class indicated by option --phasesize. If
    --phasesize is set to 'all', all clusters within the Dicer size range
    will be analyzed. Conversely, phasing analysis is suppressed for all
    clusters if option --phasesize is set to 'none'.

    2. Cluster must also have a length of more than 4 x the phase size in
    question .. so, more than 84nts under the default --phasesize 21
    setting. Clusters that are too short are never examined.

    3. Phasing is only analyzed with respect to the dominant size of the
    cluster. So, for a cluster dominated by 21mers, only phasing in 21nt
    increments will be examined.

    4. The 5' positions of all sense-mapped small RNAs are tallied as a
    function of genomic position. The 3' positions of all antisense-mapped
    small RNAs are also tallied, after adding 2nts to account for the 2nt,
    3' overhangs left by Dicer processing. After this process, each genomic
    position within the cluster has a number reflecting the number of small
    RNA termini at that position. If the cluster is longer than 20 times the
    phase (e.g. 20 x 21 for the default settings), reads mapped beyond the
    20 x 21 mark are allocated to the beginning of the cluster, keeping it
    in phase. For instance, assuming --phasesize of 21, reads in position
    420 are assigned at 420, those at 421 get flipped back to 1, 422 back to
    2, and so on. This is necessary because p-value calculation involved
    calculation of binomial coefficents, which grow too large to calculate
    (easily) with inputs of more than 500 or so.

    5. The average abundance of termini across the locus is calculated from
    the above representation of the reads.

    6. The total abundance in each of the possible phasing registers (there
    are 21 registers in the default mode of --phasesize 21) is calculated.
    The register with the maximum total abundance is the used in p-value
    determination. The offset of this register is also noted; the offset is
    the 1st genomic position representing the 5'-sense position of a phased
    small RNA.

    7. The p-value within the chosen register is then calculated using the
    cumulative distribution function (CDF) for the hypergeometric
    distribution. Sorry, hard to show equations in plain-text -- see
    Wikipedia's Hypergeometric distribution entry, under CDF. N (the
    population size) is the number of nt positions in the locus. m (the
    number of success states in the population) is the number of possible
    positions in the phasing register of interest, INLCUDING POSITIONS +1
    AND -1 RELATIVE TO THE REGISTER OF INTEREST. This means phasing is
    "fuzzy", which is often seen in the known examples of this phenomenon. n
    (the number of draws) is defined as the total number of positions with
    ABOVE AVERAGE abundance. k (the number of successes) is the number of
    phased positions (inlduing the fuzzy +1 and -1 positions) with ABOVE
    AVERAGE abundance. The p-value is then calculated per the hypergeometric
    distribution CDF. NOTE: The restriction of n and k to only above-average
    abundance works well to eliminate low-level noise and focus on the
    dominant small RNA pattern within the locus.

    Note: P-values are not corrected for multiple-testing. Consider
    adjustment of p-values to control for multiple testing (e.g. Bonferroni,
    Benjamini-Hochberg FDR, etc) if you want a defensible set of phased loci
    from a genome-wide analysis.