Skip to content

Command line interface

Damien Farrell edited this page Jul 30, 2017 · 13 revisions

Installing the package provides the command smallrnaseq in your path. This allows users is a command line interface to the library without the need for any Python coding at all. It provides a set of pre-defined functions with parameters specified in a text configuration file.

Table of Contents

Usage

Usage largely involves setting up the config file and having your input files prepared. Running the command smallrnaseq -c default.conf will create a new config file for you to work from if none exists. Just edit this with a text editor and then to execute:

smallrnaseq -c default.conf -r

Several other functions are available from the command line without the config file, i.e. to collapse or trim reads. Type smallrnaseq -h to get a list of options.

Configuration file settings

The advantage of configuration files is in avoiding long commands that have to be remembered or are prone to mistakes. Also the config files can be kept to recall what setting we used or to copy them for another set of files. The current options available in the file are shown below. The meaning of each option is explained explained below. If you are unsure or don't require an option value, leave it at the default. Options can be excluded from the file completely and defaults will be used but it's recommended to just leave unused options blank. You can also comment out lines with a '#' at the start if you want it to be ignored. The [base] heading should always be present an indicates a section of the file. The [aligner] section is for setting alignment parameters on a per library basis if you need to. The [novel] section is for a few novel mirna prediction settings. The [de] section is used for differential expression which can be ignored if you don't need that.

[base]
filenames = 
path = 
overwrite = 0
index_path = indexes
libraries = 
ref_fasta = 
features = 
output = results
add_labels = 1
aligner = bowtie
mirna = 0
species = hsa
pad5 = 3
pad3 = 5

[aligner]
default_params = -v 1 --best
mirna_params = -v 1 -a --best --strata --norc

[novel]
score_cutoff = 0.8
read_cutoff = 50

[de]
sample_labels = 
sep = ,
sample_col = 
factors_col = 
conditions = 
logfc_cutoff =

Settings explained:

name example value meaning
filenames test.fastq input fastq file(s) with reads
path testfiles folder containing fastq files instead of listing files
index_path indexes location of bowtie or subread indexes
aligner bowtie which aligner to use, bowtie or subread
ref_fasta hg19 reference genome fasta file, optional
libraries RFAM_human,mirbase-hsa names of annotated library indexes to map to
features Homo_sapiens.GRCh37.75.gtf genome annotation file
output smrna_results output folder for temp files
add_labels 1 whether to add labels to replace the file names in the results (0 or 1)
mirna 0 run mirna counting workflow (0 or 1)
species bta mirbase species to use
pad5 3 3' flanking bases to add when generating mature mirbase sequences
pad3 5 5' flanking bases to add
sample_labels samplefile.txt csv file with sample labels
default_params -v 1 --best default alignment parameters
mirna_params -v 1 -a --best --strata --norc default miRNA alignment parameters

Examples

Mapping to one or more libraries of known RNAs

This will simply consist of setting the libraries option to match the names of files you have created aligner indexes for. You can download RNA annotations in fasta format from http://rnacentral.org/. The index files should be placed in the index_path folder. To create an index using bowtie you supply a fasta file and the indexes are placed in a folder called 'indexes'. You can move them to another folder if needed:

smallrnaseq -b myrnas.fa 

By default file names are replaced with short ids of the form s01, s02, etc. This also writes out a file called sample_labels.csv in the output folder. Set add_labels = 0 if you don't want this behaviour.

Mapping to known miRNAs and finding novel miRs

Say we have a set of fastq files in the folder 'testfiles' that we want to count miRNAs in. We would set the options mirna = 1 and path = testfiles. Note if just mapping to mirbase mature/precursor sequence you don't have to create an index file since it is generated automatically. If you are using any other libraries you should create the aligner index first. For novel miRNA discovery we need the reference genome and index for that.

Mapping to a reference genome with small RNA features (advanced)

There are some advantages to using a full reference genome in that it allows us to see of the reads align outside the target transcripts and we might therefore exclude them. Also we can normalise the read counts by the sum of all mapped reads. This depends on what features you have in the gtf file. To count our reads using features in a reference genome, provide the ref_genome name which should correspond to the index name of the reference sequence. We also need to set features, a set of features stored in a gtf, gff or bed file. Remember that the coordinates in this file must correspond to the reference sequence. See Counting features for more information.

Outputs

The main outputs are csv files with the counts for each sample in a column, also produced is a 'long form' csv file with a separate row for every sample. These csv files can be opened in a spreadsheet. Temporary files such as collapsed read files are placed in a 'temp' folder inside the output folder.

Differential expression

This workflow is done using a separate command and via some extra options not shown above for clarity. To execute this type of analysis you must have done gene counting (e.g. miRNAs) and have the results in a csv file. The analysis is then run using:

smallrnaseq -c default.conf -d

In the default file the additional options needed are in a separate [de] section. Most are blank by default. If you want to do differential expression analysis of the genes from your results the other main thing you need to provide is a sample label file that matches the filenames you have analysed to the conditions. You then choose which conditions you want to compare. The options are explained below.

name example value meaning
count_file mirna_counts.csv csv file with gene counts
sample_labels labels.txt csv/text file with sample labels
sep , separator for sample labels text file
sample_col file_name column for the sample file names
factors_col status column for factor labels
conditions control,infected conditions to compare from factor column (each replicate will have the same condition)
logfc_cutoff 1.5 cutoff for log fold changes (lower are ignored)

This analysis is run using the R edgeR package, so R must be installed. See Installing R

Example

The sample label file below shows how we use the above options. Our filenames are in the Run_s column. We want to compare the conditions 3 and 6 months samples in the age_s (the 'factor') column. mirna_mature_counts.csv is the file where counts from our mirna analysis are stored. This should have a column for each sample. Note that you could use counts output from any program as long as they are csv and have the right column names, see Result file format. It is assumed you are using replicates. Note that column names are case sensitive.

Run_s	age_s	isolate_s
SRR3457948	3 months	animal1
SRR3457949	6 months	animal1
SRR3457950	6 months	animal4
SRR3457951	15 months	animal4
SRR3457952	3 months	animal5
SRR3457953	6 months	animal5
SRR3457954	15 months	animal5
SRR3457955	3 months	animal6
SRR3457956	6 months	animal6
...

So the config file will look like this:

[de]
sample_labels = SraRunTable.txt
sep = tab #tab delimiter
count_file = mirna_mature_counts.csv
sample_col = Run_s
factors_col = age_s
conditions = 3 months,6 months
logfc_cutoff = 1.5