Sequence Analysis Tutorial in R and Bioconductor.
Here I tried to learn how to Analyse NGS in R without using Linux Terminal.
- Hundreds of reusable NGS packages are available
- Invent new things rather than reinventing existing ones
- Many NGS methods require advanced statistical methods
- Many NGS applications share similar analysis needs. Most of them have existing solutions.
- Access to advanced and reproducible genome graphics
Some basic string handling utilities. Wide spectrum of numeric data analysis tools.
Bioconductor packages provide much more sophisticated string handling utilities for sequence analysis (Lawrence et al. 2013; Huber et al. 2015).
To install bioconductor packages, execute the following lines in the R console. Please also make sure that you have a recent R version installed on your system. R versions 3.3.x or higher are recommended.
source("https://bioconductor.org/biocLite.R")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager") BiocManager::install(c("Biostrings", "GenomicRanges", "rtracklayer", "systemPipeR", "seqLogo", "ShortRead"))
- Biostrings: general sequence analysis environment
- ShortRead: pipeline for short read data
- IRanges: low-level infrastructure for range data
- GenomicRanges: high-level infrastructure for range data
- GenomicFeatures: managing transcript centric annotations
- GenomicAlignments: handling short genomic alignments
- systemPipeR: NGS workflow and report generation environment
- Rsamtools: interface to samtools, bcftools and tabix
- BSgenome: genome annotation data
- biomaRt: interface to BioMart annotations
- rtracklayer: Annotation imports, interface to online genome browsers
- HelloRanges: Bedtools semantics in Bioc’s Ranges infrastructure
XString for single sequence :
- DNAString: for DNA
- RNAString: for RNA
- AAString: for amino acid
- BString: for any string
XStringSet for many sequences:
- DNAStringSet: for DNA
- RNAStringSet: for RNA
- AAStringSet: for amino acid
- BStringSet: for any string
QualityScaleXStringSet for sequences with quality data:
- QualityScaledDNAStringSet: for DNA
- QualityScaledRNAStringSet: for RNA
- QualityScaledAAStringSet: for amino acid
- QualityScaledBStringSet: for any string
Download the following sequences to your current working directory and then import them into R:
Sequence and Quality Data: FASTQ Format
Four lines per sequence:
ID Sequence ID Base call qualities (Phred scores) as ASCII characters
The following gives an example of 3 Illumina reads in a FASTQ file.
- @SRR038845.3 HWI-EAS038:6:1:0:1938 length=36
- CAACGAGTTCACACCTTGGCCGACAGGCCCGGGTAA
- +SRR038845.3 HWI-EAS038:6:1:0:1938 length=36
- BA@7>B=>:>>7@7@>>9=BAA?;>52;>:9=8.=A
Phred quality scores are integers from 0-50 that are stored as ASCII characters after adding 33. The basic R functions rawToChar and charToRaw can be used to interconvert among their representations.
Important Data Objects for Range Operations.
- IRanges: stores range data only (IRanges library)
- GRanges: stores ranges and annotations (GenomicRanges library)
- GRangesList: list version of GRanges container (GenomicRanges library)