Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Biostrings2classes.Rnw to .Rmd #83

Open
wants to merge 8 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
262 changes: 262 additions & 0 deletions vignettes/Biostrings2Classes.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
---
title: "The *Biostrings* 2 classes (work in progress)"
author:
- name: "Hervé Pagès"
- name: "Paul Villafuerte"
affiliation: "Vignette translation from Sweave to Rmarkdown / HTML"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
%\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2}
%\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment}
%\VignettePackage{Biostrings}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
output:
BiocStyle::html_document:
number_sections: true
toc: yes
toc_depth: 4
editor_options:
markdown:
wrap: 72
---

# Introduction

This document briefly presents the new set of classes implemented in the
`r Biocpkg('Biostrings')` 2 package. Like the `r Biocpkg('Biostrings')`
1 classes (found in `r Biocpkg('Biostrings')` v 1.4.x), they were designed to make manipulation of big
strings (like DNA or RNA sequences) easy and fast. This is achieved by
keeping the 3 following ideas from the `r Biocpkg('Biostrings')` 1 package: (1) use R
external pointers to store the string data, (2) use bit patterns to
encode the string data, (3) provide the user with a convenient class of
objects where each instance can store a set of views *on the same* big
string (these views being typicallythe matches returned by a search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between typicallythe

algorithm).

However, there is a flaw in the `r Biocpkg('Biostring')` class design that prevents
the search algorithms to return correct information about the matches
(i.e. the views) that they found. The new classes address this issue by
replacing the `r Biocpkg('Biostrings')` class (implemented in `r Biocpkg('Biostrings')` 1) by 2 new
Comment on lines +37 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lines 37 and 40 (the first reference only) reference an old BioString class, so replace

`r Biocpkg('Biostrings')`

with

BioString

classes: (1) the `*XString* class used to represent a *single* string,
and (2) the *XStringViews* class used to represent a set of views *on
the same* *XString* object, and by introducing new implementations and
new interfaces for these 2 classes.

# The *XString* class and its subsetting operator `[`

The *XString* is in fact a virtual class and therefore cannot be
instanciated. Only subclasses (or subtypes) *BString*, *DNAString*,
*RNAString* and *AAString* can. These classes are direct extensions of
the *XString* class (no additional slot).

A first *BString* object:

```{r a1a, message=FALSE}
library(Biostrings)
```

```{r a1b}
b <- BString("I am a BString object")
b
length(b)
```

A *DNAString* object:

```{r a2}
d <- DNAString("TTGAAAA-CTC-N")
d
length(d)
```

The differences with a *BString* object are: (1) only letters from the
*IUPAC extended genetic alphabet* + the gap letter (`-`) are allowed and
(2) each letter in the argument passed to the `DNAString` function is
encoded in a special way before it's stored in the *DNAString* object.

Access to the individual letters:

```{r a3}
d[3]
d[7:12]
d[]
b[length(b):1]
```

Only *in bounds* positive numeric subscripts are supported.

In fact the subsetting operator for *XString* objects is not efficient
and one should always use the `subseq` method to extract a substring
from a big string:

```{r a4}
bb <- subseq(b, 3, 6)
dd1 <- subseq(d, end=7)
dd2 <- subseq(d, start=8)
```

To *dump* an *XString* object as a character vector (of length 1), use
the `toString` method:

```{r a5}
toString(dd2)
```

Note that `length(dd2)` is equivalent to `nchar(toString(dd2))` but the
latter would be very inefficient on a big *DNAString* object.

_**[TODO: Make a generic of the substr() function to work with XString
objects. It will be essentially doing toString(subseq()).]**_

# The `==` binary operator for *XString* objects

The 2 following comparisons are `TRUE`:

```{r b1, results="hide"}
bb == "am a"
dd2 != DNAString("TG")
```

When the 2 sides of `==` don't belong to the same class then the side
belonging to the "lowest" class is first converted to an object
belonging to the class of the other side (the "highest" class). The
class (pseudo-)order is *character* \< *BString* \< *DNAString*. When
both sides are *XString* objects of the same subtype (e.g. both are
*DNAString* objects) then the comparison is very fast because it only
has to call the C standard function `memcmp()` and no memory allocation
or string encoding/decoding is required.

The 2 following expressions provoke an error because the right member
can't be "upgraded" (converted) to an object of the same class than the
left member:

```{r b2, echo=FALSE}
cat('> bb == ""')
cat('> d == bb')
```

When comparing an *RNAString* object with a *DNAString* object, U and T
are considered equals:

```{r b3}
r <- RNAString(d)
r
r == d
```

# The *XStringViews* class and its subsetting operators `[` and `[[`

An *XStringViews* object contains a set of views *on the same* *XString*
object called the *subject* string. Here is an *XStringViews* object
with 4 views:

```{r c1}
v4 <- Views(dd2, start=3:0, end=5:8)
v4
length(v4)
```

Note that the 2 last views are *out of limits*.

You can select a subset of views from an *XStringViews* object:

```{r c3}
v4[4:2]
```

The returned object is still an *XStringViews* object, even if we select
only one element. You need to use double-brackets to extract a given
view as an *XString* object:

```{r c4}
v4[[2]]
```

You can't extract a view that is *out of limits*:

```{r c6,echo=FALSE}
cat('> v4[[3]]')
cat(try(v4[[3]], silent=TRUE))
```

Note that, when `start` and `end` are numeric vectors and `i` is a
*single* integer, `Views(b, start, end)[[i]]` is equivalent to
`subseq(b, start[i], end[i])`.

Subsetting also works with negative or logical values with the expected
semantic (the same as for $R$ built-in vectors):

```{r c7}
v4[-3]
v4[c(TRUE, FALSE)]
```

Note that the logical vector is recycled to the length of `v4`.

# A few more *XStringViews* objects

12 views (all of the same width):

```{r d1}
v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11)
```

This is the same as doing `Views(d, start=1, end=length(d))`:

```{r d2, results="hide"}
as(d, "Views")
```

Hence the following will always return the `d` object itself:

```{r d3, results="hide"}
as(d, "Views")[[1]]
```

3 *XStringViews* objects with no view:

```{r d4, results="hide"}
v12[0]
v12[FALSE]
Views(d)
```

# The `==` binary operator for *XStringViews* objects

This operator is the vectorized version of the `==` operator defined
previously for *XString* objects:

```{r e1}
v12 == DNAString("TAA")
```

To display all the views in `v12` that are equals to a given view, you
can type $R$ cuties like:

```{r e2}
v12[v12 == v12[4]]
v12[v12 == v12[1]]
```

This is `TRUE`:

```{r e3, results="hide"}
v12[3] == Views(RNAString("AU"), start=0, end=2)
```

# The `start`, `end` and `width` methods

```{r f1}
start(v4)
end(v4)
width(v4)
```

Note that `start(v4)[i]` is equivalent to `start(v4[i])`, except that
the former will not issue an error if `i` is out of bounds (same for
`end` and `width` methods).

Also, when `i` is a *single* integer, `width(v4)[i]` is equivalent to
`length(v4[[i]])` except that the former will not issue an error if `i`
is out of bounds or if view `v4[i]` is *out of limits*.
Loading