-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Biostrings2classes.Rnw to .Rmd #83
Open
villafup
wants to merge
8
commits into
Bioconductor:devel
Choose a base branch
from
villafup:biostrings2classes-rmd
base: devel
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 7 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
7adafad
Initial conversion to Rmd
villafup 0326e70
Biostrings2Classes Initial conversion from Rnw to Rmd
villafup c344433
Biostrings2classes Rnw to Rmd
villafup ddd00b1
Biostrings2classes Rnw to Rmd #2
villafup 45d0d6e
Biostrings2classes Rnw to Rmd #2
villafup eb7562c
Biostrings2Classes.Rnw to .Rmd, remove .Rnw
villafup 10d4296
Biostrings2Classes.Rnw to .Rmd
villafup 19e93fb
Merge branch 'Bioconductor:devel' into biostrings2classes-rmd
villafup File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,262 @@ | ||
--- | ||
title: "The *Biostrings* 2 classes (work in progress)" | ||
author: | ||
- name: "Hervé Pagès" | ||
- name: "Paul Villafuerte" | ||
affiliation: "Vignette translation from Sweave to Rmarkdown / HTML" | ||
date: "`r format(Sys.time(), '%B %d, %Y')`" | ||
vignette: > | ||
%\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2} | ||
%\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment} | ||
%\VignettePackage{Biostrings} | ||
%\VignetteEncoding{UTF-8} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
output: | ||
BiocStyle::html_document: | ||
number_sections: true | ||
toc: yes | ||
toc_depth: 4 | ||
editor_options: | ||
markdown: | ||
wrap: 72 | ||
--- | ||
|
||
# Introduction | ||
|
||
This document briefly presents the new set of classes implemented in the | ||
`r Biocpkg('Biostrings')` 2 package. Like the `r Biocpkg('Biostrings')` | ||
1 classes (found in `r Biocpkg('Biostrings')` v 1.4.x), they were designed to make manipulation of big | ||
strings (like DNA or RNA sequences) easy and fast. This is achieved by | ||
keeping the 3 following ideas from the `r Biocpkg('Biostrings')` 1 package: (1) use R | ||
external pointers to store the string data, (2) use bit patterns to | ||
encode the string data, (3) provide the user with a convenient class of | ||
objects where each instance can store a set of views *on the same* big | ||
string (these views being typicallythe matches returned by a search | ||
algorithm). | ||
|
||
However, there is a flaw in the `r Biocpkg('Biostring')` class design that prevents | ||
the search algorithms to return correct information about the matches | ||
(i.e. the views) that they found. The new classes address this issue by | ||
replacing the `r Biocpkg('Biostrings')` class (implemented in `r Biocpkg('Biostrings')` 1) by 2 new | ||
Comment on lines
+37
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think lines 37 and 40 (the first reference only) reference an old
with
|
||
classes: (1) the `*XString* class used to represent a *single* string, | ||
and (2) the *XStringViews* class used to represent a set of views *on | ||
the same* *XString* object, and by introducing new implementations and | ||
new interfaces for these 2 classes. | ||
|
||
# The *XString* class and its subsetting operator `[` | ||
|
||
The *XString* is in fact a virtual class and therefore cannot be | ||
instanciated. Only subclasses (or subtypes) *BString*, *DNAString*, | ||
*RNAString* and *AAString* can. These classes are direct extensions of | ||
the *XString* class (no additional slot). | ||
|
||
A first *BString* object: | ||
|
||
```{r a1a, message=FALSE} | ||
library(Biostrings) | ||
``` | ||
|
||
```{r a1b} | ||
b <- BString("I am a BString object") | ||
b | ||
length(b) | ||
``` | ||
|
||
A *DNAString* object: | ||
|
||
```{r a2} | ||
d <- DNAString("TTGAAAA-CTC-N") | ||
d | ||
length(d) | ||
``` | ||
|
||
The differences with a *BString* object are: (1) only letters from the | ||
*IUPAC extended genetic alphabet* + the gap letter (`-`) are allowed and | ||
(2) each letter in the argument passed to the `DNAString` function is | ||
encoded in a special way before it's stored in the *DNAString* object. | ||
|
||
Access to the individual letters: | ||
|
||
```{r a3} | ||
d[3] | ||
d[7:12] | ||
d[] | ||
b[length(b):1] | ||
``` | ||
|
||
Only *in bounds* positive numeric subscripts are supported. | ||
|
||
In fact the subsetting operator for *XString* objects is not efficient | ||
and one should always use the `subseq` method to extract a substring | ||
from a big string: | ||
|
||
```{r a4} | ||
bb <- subseq(b, 3, 6) | ||
dd1 <- subseq(d, end=7) | ||
dd2 <- subseq(d, start=8) | ||
``` | ||
|
||
To *dump* an *XString* object as a character vector (of length 1), use | ||
the `toString` method: | ||
|
||
```{r a5} | ||
toString(dd2) | ||
``` | ||
|
||
Note that `length(dd2)` is equivalent to `nchar(toString(dd2))` but the | ||
latter would be very inefficient on a big *DNAString* object. | ||
|
||
_**[TODO: Make a generic of the substr() function to work with XString | ||
objects. It will be essentially doing toString(subseq()).]**_ | ||
|
||
# The `==` binary operator for *XString* objects | ||
|
||
The 2 following comparisons are `TRUE`: | ||
|
||
```{r b1, results="hide"} | ||
bb == "am a" | ||
dd2 != DNAString("TG") | ||
``` | ||
|
||
When the 2 sides of `==` don't belong to the same class then the side | ||
belonging to the "lowest" class is first converted to an object | ||
belonging to the class of the other side (the "highest" class). The | ||
class (pseudo-)order is *character* \< *BString* \< *DNAString*. When | ||
both sides are *XString* objects of the same subtype (e.g. both are | ||
*DNAString* objects) then the comparison is very fast because it only | ||
has to call the C standard function `memcmp()` and no memory allocation | ||
or string encoding/decoding is required. | ||
|
||
The 2 following expressions provoke an error because the right member | ||
can't be "upgraded" (converted) to an object of the same class than the | ||
left member: | ||
|
||
```{r b2, echo=FALSE} | ||
cat('> bb == ""') | ||
cat('> d == bb') | ||
``` | ||
|
||
When comparing an *RNAString* object with a *DNAString* object, U and T | ||
are considered equals: | ||
|
||
```{r b3} | ||
r <- RNAString(d) | ||
r | ||
r == d | ||
``` | ||
|
||
# The *XStringViews* class and its subsetting operators `[` and `[[` | ||
|
||
An *XStringViews* object contains a set of views *on the same* *XString* | ||
object called the *subject* string. Here is an *XStringViews* object | ||
with 4 views: | ||
|
||
```{r c1} | ||
v4 <- Views(dd2, start=3:0, end=5:8) | ||
v4 | ||
length(v4) | ||
``` | ||
|
||
Note that the 2 last views are *out of limits*. | ||
|
||
You can select a subset of views from an *XStringViews* object: | ||
|
||
```{r c3} | ||
v4[4:2] | ||
``` | ||
|
||
The returned object is still an *XStringViews* object, even if we select | ||
only one element. You need to use double-brackets to extract a given | ||
view as an *XString* object: | ||
|
||
```{r c4} | ||
v4[[2]] | ||
``` | ||
|
||
You can't extract a view that is *out of limits*: | ||
|
||
```{r c6,echo=FALSE} | ||
cat('> v4[[3]]') | ||
cat(try(v4[[3]], silent=TRUE)) | ||
``` | ||
|
||
Note that, when `start` and `end` are numeric vectors and `i` is a | ||
*single* integer, `Views(b, start, end)[[i]]` is equivalent to | ||
`subseq(b, start[i], end[i])`. | ||
|
||
Subsetting also works with negative or logical values with the expected | ||
semantic (the same as for $R$ built-in vectors): | ||
|
||
```{r c7} | ||
v4[-3] | ||
v4[c(TRUE, FALSE)] | ||
``` | ||
|
||
Note that the logical vector is recycled to the length of `v4`. | ||
|
||
# A few more *XStringViews* objects | ||
|
||
12 views (all of the same width): | ||
|
||
```{r d1} | ||
v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11) | ||
``` | ||
|
||
This is the same as doing `Views(d, start=1, end=length(d))`: | ||
|
||
```{r d2, results="hide"} | ||
as(d, "Views") | ||
``` | ||
|
||
Hence the following will always return the `d` object itself: | ||
|
||
```{r d3, results="hide"} | ||
as(d, "Views")[[1]] | ||
``` | ||
|
||
3 *XStringViews* objects with no view: | ||
|
||
```{r d4, results="hide"} | ||
v12[0] | ||
v12[FALSE] | ||
Views(d) | ||
``` | ||
|
||
# The `==` binary operator for *XStringViews* objects | ||
|
||
This operator is the vectorized version of the `==` operator defined | ||
previously for *XString* objects: | ||
|
||
```{r e1} | ||
v12 == DNAString("TAA") | ||
``` | ||
|
||
To display all the views in `v12` that are equals to a given view, you | ||
can type $R$ cuties like: | ||
|
||
```{r e2} | ||
v12[v12 == v12[4]] | ||
v12[v12 == v12[1]] | ||
``` | ||
|
||
This is `TRUE`: | ||
|
||
```{r e3, results="hide"} | ||
v12[3] == Views(RNAString("AU"), start=0, end=2) | ||
``` | ||
|
||
# The `start`, `end` and `width` methods | ||
|
||
```{r f1} | ||
start(v4) | ||
end(v4) | ||
width(v4) | ||
``` | ||
|
||
Note that `start(v4)[i]` is equivalent to `start(v4[i])`, except that | ||
the former will not issue an error if `i` is out of bounds (same for | ||
`end` and `width` methods). | ||
|
||
Also, when `i` is a *single* integer, `width(v4)[i]` is equivalent to | ||
`length(v4[[i]])` except that the former will not issue an error if `i` | ||
is out of bounds or if view `v4[i]` is *out of limits*. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space between
typicallythe