forked from MikeAxtell/ShortStack
-
Notifications
You must be signed in to change notification settings - Fork 0
ShortStack: Comprehensive annotation and quantification of small RNA genes
License
shahsam/ShortStack
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
LICENSE ShortStack Copyright (C) 2012-2014 Michael J. Axtell This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. SYNOPSIS Annotation and quantification of small RNA genes based upon reference-aligned small RNA sequences CITATION If you use ShortStack in your work, please cite Axtell MJ. (2013) ShortStack: Comprehensive annotation and quantification of small RNA genes. RNA 19:740-751. doi:10.1261/rna.035279.112 Shahid S., Axtell MJ. (2013) Identification and annotation of small RNA genes using ShortStack. Methods doi:10.1016/j.ymeth.2013.10.004 AUTHOR Michael J. Axtell, Penn State University, [email protected] DEPENDENCIES perl, samtools, RNAfold, RNAplot, butter, bowtie, bowtie-build, maple ShortStack is a perl5 script, so it needs perl5 to compile. It expects to find perl5 at /usr/bin/perl. If this is not where your perl is, modify line 1 of ShortStack (the hashbang) accordingly. It also requires the package Getopt::Long, which I think is standard in most Perl distributions. If this package is not installed, get it from CPAN. samtools <http://samtools.sourceforge.net/> needs to be installed in your PATH. ShortStack was developed using samtools 0.1.18. Other versions should be OK as far as I know, but let me know if not! RNAfold and RNAplot are from the ViennaRNA package. See <http://www.tbi.univie.ac.at/~ronny/RNA/vrna2.html>. Both need to be installed in your PATH. butter (bowtie using iterative placement of repetitive small RNAs) ships with ShortStack, and is also at https://github.com/MikeAxtell/butter . It must be installed in your PATH. bowtie and bowtie-build must be version "1" .. either 0.12.x or 1.x. These are required ONLY if you are aligning reads to the genome. Bowtie can be found at http://bowtie-bio.sourceforge.net/index.shtml. Like the other dependencies, they must be in your PATH. maple (microRNA analysis program leveraging expression) ships with ShortStack, and is also at https://github.com/MikeAxtell/maple . It must be installed in your PATH. OPTIONAL DEPENDENCIES bam2wig, wigToBigWig If using ShortStack to align your small RNA-seq data, wiggle and bigwig files summarizing the coverage of the reads can be automatically created if bam2wig and wigToBigWig are installed on your system bam2wig is a perl script included with the ShortStack package, and is also available at https://github.com/MikeAxtell/bam2wig It must be installed in your PATH. wigToBigWig is available from UCSC at http://genome.ucsc.edu/goldenPath/help/bigWig.html It must be installed in your PATH ShortStack will run just fine without bam2wig and/or wigToBigWig. But, if you are aligning reads, it will give you a warning. If a wiggle and/or bigwig file are created, the track will represent total depth of coverage on both strands. You can use bam2wig to make different types of tracks with more flexible options. INSTALL There is no 'real' installation. After installing the dependencies (see above), you should check to make sure the ShortStack is executable. It should be, but if not you can: chmod +x ShortStack Then add it to your PATH .. for instance sudo cp ShortStack /usr/bin/ Depending on your system, you may need administrative privileges to install the dependencies and/or to add ShortStack to you PATH USAGE ShortStack [options] [genome.fasta] There are three modes which differ in the the types of pre-analysis that are performed. Each of the modes has a different set of REQUIRED options: Mode 1: Trim small RNA-seq reads to remove 3' adapter seqeuence, align them, and then analyze. Required options: --reads --adapter Mode 2: Align pre-trimmed small RNA-seq reads, and then analyze. Required option: --reads Mode 3: Analyze a pre-existing BAM alignment of small RNA-seq reads. Required option: --bamfile Additionally, in modes 1 or 2, the option --align_only will terminate analysis after making the alignment file. OPTIONS --help : Print a help message and then quit. --version : Print the version number and then quit. --outdir [string] : Name of directory to be created to receive results of the run. Deafults to \"ShortStack_[time]\", where time is \"UNIX time\" (the number of non-leap seconds since Jan 1, 1970 UCT), if not provided --adapter [string] : Sequence of 3' adapter to search for during adapter trimming. Must be at least 8 nts in length, and all ATGC characters. If provided, reads will be trimmed. --reads [string] : Path to reads file in fasta (.fa or .fasta extension), fastq (.fastq or .fq extension), or colorspace-fasta (.csfasta extension). Can be multiple files, separated by commas. ShortStack knows the format only thought the file extensions. --bowtie_cores [integer] : Number of processor cores to use during bowtie / butter alignment. Default: 1. --mismatches [integer] : Number of allowable mismatched for butter alignment. Must be 0 or 1. Default: 0. --max_rep [integer] : Reads with more than this number of possible alignment positions will be reported as unmapped regardless of butter density placement probabilities. Default: 1000. --ranmax [integer] : Reads with more than this number of possible alignment positions where the choice can't be guided by butter-calculated probabilities will be reported as unmapped. Default: 3. --HPscore : Minimum maple-derived score in order to keep a locus that failed as a MIRNA as an HP locus. Deafult: 0.9 --align_only : Exits program after completion of small RNA-seq data alignment, creating BAM file. --bamfile [string] : Path to properly formatted and sorted BAM alignment file of small RNA-seq data. Files require custom tags provided by the butter aligner. --read_group [string] : Analyze only the indicated read-group. Read-group must be specified in the bam alignment file header. Default = [not active -- all reads analyzed] --flag_file [string] : PATH to a simple file of genomic loci of interest. The ShortStack-analyzed small RNA clusters will be analyzed for overlap with the loci in the flag_file .. if there is any overlap (as little as one nt), it will be reported. Format for this file is describe below. --mindepth [integer] : Minimum depth of mapping coverage to define an 'island'. Default = 20. Must be at least 2, more than 5 preferred. --pad [integer] : Number of nucleotides upstream and downstream to extend initial islands during cluster definition. Default = 100 --dicermin [integer] : Smallest size in the Dicer size range (or size range of interest). Deafult = 20. Must be between 15 and 35, and less than or equal to --dicermax --dicermax [integer] : Largest size in the Dicer size range (or size range of interest). Deafult = 24. Must be between 15 and 35, and more than or equal to --dicermin --miRType [string] : Either \"plant\" or \"animal\". Defaults to \"plant\". --minstrandfrac [float] : Minimum fraction of mappings to one or the other strand call a polarity for non-hairpin clusters. Also the minimum fraction of \"non-dyad\" mappings to the sense strand within potential hairpins/miRNAs to keep the locus annotated as a hp or miRNA. See below for details. Default = 0.8. Allowed values between 0.5 and 1. --mindicerfrac [float] : Minimum fraction of mappings within Dicer size range to annotate a locus as Dicer-derived. Default = 0.85. Allowed values between 0 and 1. --phasesize [integer] : Examine phasing only for clusters dominated by the indicated size range. Size must be within the bounds described by --dicermin and --dicermax. Set to 'all' to examine p-values of each locus within the Dicer range, in its dominant size. Set to 'none' to suppress all phasing analysis. Default = 21. Allowed values between --dicermin and --dicermax. --count [string] : Invokes count mode, in which user-provided clusters are annotated and quantified instead of being defined de novo. When invoked, the file provided with --count is assumed to contain a simple list of clusters. Count mode also forces nohp mode. Formatting details below. Default : Not invoked. --nohp : If \"--nohp\" appears on the command line, it invokes running in \"no hairpin\" mode. RNA folding, hairpin annotation, and MIRNA annotation will be skipped (likely saving significant time). Note that --count mode forces --nohp mode as well. Default: Not invoked. KEY FORMATTING REQUIREMENTS AND ASSUMPTIONS Input genome.fasta file It is critical that this be the precise genome to which the reads in the input .bam file were mapped. If it isn't, validation of the BAM alignment file will fail and the run will be aborted. Chromsome names that contain whitespace will be truncated; only the first string of non-white-space characters will be maintained. This applies both during alignment (where bowtie does this) and during the rest of the analysis (where samtools does this, when creating the .fai index file). If not already present, a .fai index file for the genome will be created using samtools faidx at the beginning of the run. As above, chromosome names will be automatically trimmed starting at the first white-space character, if present. Small RNA-seq reads. FASTA or FASTQ data should be devoid of comment lines and conform to FASTA or FASTQ specs. In addition, ShortStack assumes that each read will occupy a single line in the file. There is no support for paired-end reads. Colorspace-FASTA formatted data (from SOLiD) can have comment lines. Format is assumed to conform to colorspace-FASTA specifications (beginning with a nucleotide, followed by a string of colors [0,1,2,3] or ambiguity codes [.]. For SOLiD data, the quality values (_QV.qual files) are not usable as valid inputs. Input .bam file As of ShortStack 2.0.0, bam alignment files must have been created with the program butter. Validation of bam files requires: If you do make your own BAM alignments, outside of ShortStack, they must pass the following validation steps that are performed by ShortStack: 1. The header must be present 2. The sort order of the file must be 'coordinate', as indicated by the SO: tag in the header 3. All of the chromosome names found in the header MUST also be found in the genome.fasta file 4. The data lines must contain the custom XX, XY, and XZ tags added by butter. 5. If the option --read_group is being used, the specified read_group must be mentioned in the header in an @RG line. The BAM file should be indexed and have the corresonding .bam.bai index file in the same path as the bamfile. However, this is not required to pass validation .. if the index is not found, it will be created during the run. Each mapped read must have the CIGAR string set (column 6 in the SAM specification) -- ShortStack determines the small RNA lengths by parsing the CIGAR string .. if any mappings (except unmapped reads, which are ignored) have "*" entered instead of a valid CIGAR string ShortStack will exit and complain. --count file If running in --count mode, the user-provided file is expected to be a simple text file containing a list of coordinates in the format : [Chr]:[start]-[stop], where Chr is defined in the genome file AND in the .bam file, and start and stop are one-based, inclusive. The same requirement for short, non-whitespaced chromosome names as discussed above holds true for input --count files. Comment lines, that begin with '#', are ignored. Tab-delimited files are also accepted, provided the first column has the coordinates. The second column in tab-delimted files is assumed to be the names of the clusters, and will be used accordingly. Any other columns in a tab-delimited input file are ignored. Importantly, the 'Results.txt' file produced by a previous ShortStack run can be used directly in subsequent runs in --count mode. This is useful when comparing identical intervals across multiple samples. Note that count mode also forces nohp mode. --flag_file Optional. This is a list of genomic loci to scan for overlap with one or more of the small RNA loci found/analyzed by ShortStack. Overlap of any length is reported. The format of the file is similar to that of the --count file: A tab-delimited text file with coordinates in the first column, and names in the second column. Unlike for --count files, names are required to be present in the second column for --flag_file. Coordinates must be in the format [Chr]:[start]-[stop], where Chr is defined in the genome file AND the .bam file, and start and stop are one-based, inclusive. OUTPUT Results.txt This is a simple tab-delimited text file. The first line begins with a "#" (comment) sign, and then lists column headers. Each subsequent line describes the key traits of a single cluster. To import this into R, here's a tip to deal with the first line, which has the headers but begins with a "#" character. >results <- read.table("Results.txt", head=TRUE, sep="\t", comment.char="") Column 1: Locus : The genome-browser-friendly coordinates of the clusters. Coordinates are one-based, inclusive (e.g. Chr1:1-100 refers to a 100 nt interval beginning with nt 1 and ending with nt 100). Column 2: Name : Name of cluster. Unless the run was in --count mode and the input file of a priori clusters already had names, the names are arbitrarily designated as "Cluster_1", "Cluster_2", etc. Column 3: FlagOverlap : Name(s) of any loci from the flag_file that overlap with the cluster are listed. If there are two or more, they are comma-separated. If there were none, or no flag_file was provided, than a "." is present in this column instead. Column 4: Size : Size in nts of the locus. Column 5: MIRNA : Whether this cluster appears to be a MIRNA or not. If not, a "." is present. If it is a hairpin, but NOT qualified as a MIRNA, "HP" is indicated. MIRNAs are indicated by "MIRNA". If the run was in "--nohp" mode, than all entries in the column will be ".". Column 6: MIRNA_Score : Score of the locus via MIRNA analysis by maple. Ranges from 0-1, with 1 being the best. "NA" if locus was pre-excluded from maple analysis (because of excessive length, not coming from a clear single-strand, or DicerCall of N). Column 7: Strand : The predominant genomic strand from which the small RNA emanate. If ".", no strand was called. Column 8: Frac_Wat : Fraction of aligned reads to the Watson (e.g. +) strand of the cluster. 1 means all were from Watson Strand (e.g. +), 0 means all were from Crick (e.g. -) strand. Column 9: Total : Total aligned reads within the cluster. Column 10: Uniques : Total aligned reads derived from uniquely mapped reads .. e.g., those with XX:i:1. Column 11: DicerCall : If "N", the cluster was not annotated as dicer-derived, per options --dicermin, --dicermax, and --mindicerfrac. Otherwise this is a number, within the --dicermin to --dicermax size range, which indicates the most abundant small RNA size within the mappings at that cluster. Column 12: PhaseOffset : If "NA", phasing p-value was not calculated for this cluster. Otherwise, the offset is the one-based genomic position with which the cluster appears to be "in-phase" (based on the 5' nt of a sense-mapped small RNA). Phasing is always in increments identicial to the Dicer size call in column 9. Column 13: Phase_pval : If "NA", phasing p-value was not calculated for this cluster. Otherwise, the p-value is derived from a modified hypergeometric distribution, as described below. Column 14: Short : The total mappings from reads with lengths less than --dicermin. Column 15: Long : The total mappings from reads with lengths more than --dicermax. Columns 16 - the end : The total mappings from reads with the indicated lengths. These are the sizes within the Dicer range. Log.txt This is a simple log file which records the information that is also sent to STDERR during the run. gff3 files Two gff3-formatted files are created, one for the 'DCL' loci (those with a DicerCall that is NOT N), and the 'N' loci (those with a DicerCall of 'N'). There are NOT produced in a --count mode run. Hairpin and MIRNA detail files Text-based and graphical files for each MIRNA and HP locus are created. See the maple documentation for details. KEY METHODS Adapter trimming and alignments Adapter trimming and alignments are handled by butter. See the butter documentation for details. Multiple libraries ShortStack supports input of multiple small RNA-seq libraries through the option --reads. Files are provided as a comma-delimited list. When multiple small RNA-seq libraries are input, each one is first aligned separately, creating temporary .bam files. When this is complete, they are merged into a single alignment, which is given the name "outdir.bam" in the working directory, where "outdir" is the string given in option --outdir. During the merging, the read group information is stored, so all alignments can be de-convoluted back to their parent libraries if desired. The individual .bam alignments created intially are deleted. Read groups As of version 1.1.0, ShortStack incorporates the option --read_group. When specified, only the alignments from the read group specified by the option will be used for analysis. Use of this option demands that the indicated read group is specified in the header of the relevant bam alignment file. When an analysis uses a bam alignment file that contains more than one read group (based on the bam header), and the --read_group option was NOT used in the run, the analysis will conclude with a --count mode analysis of each read group separately. This is meant to be convenient for analyses in which a de-novo small RNA gene annotation is performed using a merger of multiple libraries, followed by quantification of each locus for each small RNA-seq library separately .. this should facilitate the downstream analysis of differential expression, for instance. de novo Cluster Discovery Cluster discovery proceeds in two simple steps: 1. The total depth of small RNA coverage at each occupied nucleotide in the genome is examined, and initial 'islands' of coverage are defined as continuous stretches of non-zero coverage where the read depth, at at lest one point, is greater than or equal to the threshold depth specified by option --mindepth. Note that this definition of islands is different, and more inclusive, than that used by ShortStack versions prior to 2.0.0. 2. The initial islands are then temporarily extended on both sides by the distance specified by option --pad. Islands that overlap after extension are merged. The "dangling pads" at the ends of the merged clusters are then removed. After all extensions, resultant mergers, and end trimmings are performed, the final result is the initial clusters. If the run is performed in --nohp mode, these are the final clusters. If hairpins and MIRNAs are being examined, some of the clusters may be adjusted in position to reflect the extent of the apparent hairpin precursor. Hairpin and MIRNA analysis MIRNA analysis is performed by maple. For speed, not all clusters are subject to analysis by maple. Clusters that exceed 1kb in length, have a DicerCall of "N", or that don't have a clear single-stranded pattern of read alignments are not sent to maple for analysis. The miRType option also limits queries based on kingdonm-specific requirement (specifically, if the miRTtype is 'animal', than the query can't be longer than 250 nts.). See the maple documentation for details of how it works. Quantification of clusters All mappings with at lease one nt of overlap within the cluster are tallied as being within the cluster. Thus, for a cluster located at Chr1:1000-2000, reads mapped to 980-1000, 1100-1123, and 2000-2021 are all counted as being within the cluster during quantification. Note that it's possible to count the same mapping within non-overlapping clusters. Analysis of Phasing 'Phasing' describes the periodic mapping of small RNAs to repeating intervals equal to their size. It occurs when helical RNA is Diced processively from a defined terminus; often the terminus is defined by a prior small RNA slicing event followed by RDRP activity, although some MIRNA hairpins are also phased. Nearly all documented examples of phased small RNA production (in plants) occur for 21nt small RNAs in 21nt increments, hence the default settings of ShortStack to examine only 21-dominated clusters. This can be changed with option --phasesize. ShortStack's basic method to identify phased small RNAs involves calculation of a p-value based on the hypergeometric distribution -- this approach was inspired by Chen et al. (2007) PNAS 104: 3318-3323 PMID: 17360645. However, ShortStack's method modifies the Chen et al. approach to make it more robust at detecting phasing in highly expressed clusters with a background of non-phased noise; the method also allows phasing analysis in any register within the dicer size range (controlled by option --phasesize), and analyzes regions of arbitrary length. Finally, ShortStack's analysis of phasing is "fuzzy" -- that it, exactly phased reads, and those +1 and -1 phase are all counted as "phased". Phasing analysis proceeds as follows: 1. Clusters to be analyzed must be annotated as Dicer-derived and be dominated by the size class indicated by option --phasesize. If --phasesize is set to 'all', all clusters within the Dicer size range will be analyzed. Conversely, phasing analysis is suppressed for all clusters if option --phasesize is set to 'none'. 2. Cluster must also have a length of more than 4 x the phase size in question .. so, more than 84nts under the default --phasesize 21 setting. Clusters that are too short are never examined. 3. Phasing is only analyzed with respect to the dominant size of the cluster. So, for a cluster dominated by 21mers, only phasing in 21nt increments will be examined. 4. The 5' positions of all sense-mapped small RNAs are tallied as a function of genomic position. The 3' positions of all antisense-mapped small RNAs are also tallied, after adding 2nts to account for the 2nt, 3' overhangs left by Dicer processing. After this process, each genomic position within the cluster has a number reflecting the number of small RNA termini at that position. If the cluster is longer than 20 times the phase (e.g. 20 x 21 for the default settings), reads mapped beyond the 20 x 21 mark are allocated to the beginning of the cluster, keeping it in phase. For instance, assuming --phasesize of 21, reads in position 420 are assigned at 420, those at 421 get flipped back to 1, 422 back to 2, and so on. This is necessary because p-value calculation involved calculation of binomial coefficents, which grow too large to calculate (easily) with inputs of more than 500 or so. 5. The average abundance of termini across the locus is calculated from the above representation of the reads. 6. The total abundance in each of the possible phasing registers (there are 21 registers in the default mode of --phasesize 21) is calculated. The register with the maximum total abundance is the used in p-value determination. The offset of this register is also noted; the offset is the 1st genomic position representing the 5'-sense position of a phased small RNA. 7. The p-value within the chosen register is then calculated using the cumulative distribution function (CDF) for the hypergeometric distribution. Sorry, hard to show equations in plain-text -- see Wikipedia's Hypergeometric distribution entry, under CDF. N (the population size) is the number of nt positions in the locus. m (the number of success states in the population) is the number of possible positions in the phasing register of interest, INLCUDING POSITIONS +1 AND -1 RELATIVE TO THE REGISTER OF INTEREST. This means phasing is "fuzzy", which is often seen in the known examples of this phenomenon. n (the number of draws) is defined as the total number of positions with ABOVE AVERAGE abundance. k (the number of successes) is the number of phased positions (inlduing the fuzzy +1 and -1 positions) with ABOVE AVERAGE abundance. The p-value is then calculated per the hypergeometric distribution CDF. NOTE: The restriction of n and k to only above-average abundance works well to eliminate low-level noise and focus on the dominant small RNA pattern within the locus. Note: P-values are not corrected for multiple-testing. Consider adjustment of p-values to control for multiple testing (e.g. Bonferroni, Benjamini-Hochberg FDR, etc) if you want a defensible set of phased loci from a genome-wide analysis.
About
ShortStack: Comprehensive annotation and quantification of small RNA genes
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published