Bioconductor · villafup · Sep 4, 2022 · Sep 7, 2022 · Nov 22, 2022 · Nov 22, 2022
diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd
@@ -0,0 +1,262 @@
+---
+title: "The *Biostrings* 2 classes (work in progress)"
+author: 
+- name: "Hervé Pagès"
+- name: "Paul Villafuerte"
+  affiliation: "Vignette translation from Sweave to Rmarkdown / HTML"
+date: "`r format(Sys.time(), '%B %d, %Y')`"
+vignette: >
+  %\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2}
+  %\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment}
+  %\VignettePackage{Biostrings}
+  %\VignetteEncoding{UTF-8}
+  %\VignetteEngine{knitr::rmarkdown}
+output:
+  BiocStyle::html_document:
+    number_sections: true
+    toc: yes
+    toc_depth: 4
+editor_options: 
+  markdown:   
+    wrap: 72 
+---
+
+# Introduction
+
+This document briefly presents the new set of classes implemented in the
+`r Biocpkg('Biostrings')` 2 package. Like the `r Biocpkg('Biostrings')` 
+1 classes (found in `r Biocpkg('Biostrings')` v 1.4.x), they were designed to make manipulation of big
+strings (like DNA or RNA sequences) easy and fast. This is achieved by
+keeping the 3 following ideas from the `r Biocpkg('Biostrings')` 1 package: (1) use R
+external pointers to store the string data, (2) use bit patterns to
+encode the string data, (3) provide the user with a convenient class of
+objects where each instance can store a set of views *on the same* big
+string (these views being typicallythe matches returned by a search
+algorithm).
+
+However, there is a flaw in the `r Biocpkg('Biostring')` class design that prevents
+the search algorithms to return correct information about the matches
+(i.e. the views) that they found. The new classes address this issue by
+replacing the `r Biocpkg('Biostrings')` class (implemented in `r Biocpkg('Biostrings')` 1) by 2 new
+classes: (1) the `*XString* class used to represent a *single* string,
+and (2) the *XStringViews* class used to represent a set of views *on
+the same* *XString* object, and by introducing new implementations and 
+new interfaces for these 2 classes.
+
+# The *XString* class and its subsetting operator `[`
+
+The *XString* is in fact a virtual class and therefore cannot be
+instanciated. Only subclasses (or subtypes) *BString*, *DNAString*,
+*RNAString* and *AAString* can. These classes are direct extensions of
+the *XString* class (no additional slot).
+
+A first *BString* object:
+
+```{r a1a, message=FALSE}
+library(Biostrings)
+```
+
+```{r a1b}
+b <- BString("I am a BString object")
+b
+length(b)
+```
+
+A *DNAString* object:
+
+```{r a2}
+d <- DNAString("TTGAAAA-CTC-N")
+d
+length(d)
+```
+
+The differences with a *BString* object are: (1) only letters from the
+*IUPAC extended genetic alphabet* + the gap letter (`-`) are allowed and
+(2) each letter in the argument passed to the `DNAString` function is
+encoded in a special way before it's stored in the *DNAString* object.
+
+Access to the individual letters:
+
+```{r a3}
+d[3]
+d[7:12]
+d[]
+b[length(b):1]
+```
+
+Only *in bounds* positive numeric subscripts are supported.
+
+In fact the subsetting operator for *XString* objects is not efficient
+and one should always use the `subseq` method to extract a substring
+from a big string:
+
+```{r a4}
+bb <- subseq(b, 3, 6)
+dd1 <- subseq(d, end=7)
+dd2 <- subseq(d, start=8)
+```
+
+To *dump* an *XString* object as a character vector (of length 1), use
+the `toString` method:
+
+```{r a5}
+toString(dd2)
+```
+
+Note that `length(dd2)` is equivalent to `nchar(toString(dd2))` but the
+latter would be very inefficient on a big *DNAString* object.
+
+_**[TODO: Make a generic of the substr() function to work with XString
+objects. It will be essentially doing toString(subseq()).]**_
+
+# The `==` binary operator for *XString* objects
+
+The 2 following comparisons are `TRUE`:
+
+```{r b1, results="hide"}
+bb == "am a"
+dd2 != DNAString("TG")
+```
+
+When the 2 sides of `==` don't belong to the same class then the side
+belonging to the "lowest" class is first converted to an object
+belonging to the class of the other side (the "highest" class). The
+class (pseudo-)order is *character* \< *BString* \< *DNAString*. When
+both sides are *XString* objects of the same subtype (e.g. both are
+*DNAString* objects) then the comparison is very fast because it only
+has to call the C standard function `memcmp()` and no memory allocation
+or string encoding/decoding is required.
+
+The 2 following expressions provoke an error because the right member
+can't be "upgraded" (converted) to an object of the same class than the
+left member:
+
+```{r b2, echo=FALSE}
+cat('> bb == ""')
+cat('> d == bb')
+```
+
+When comparing an *RNAString* object with a *DNAString* object, U and T
+are considered equals:
+
+```{r b3}
+r <- RNAString(d)
+r
+r == d
+```
+
+# The *XStringViews* class and its subsetting operators `[` and `[[`
+
+An *XStringViews* object contains a set of views *on the same* *XString*
+object called the *subject* string. Here is an *XStringViews* object
+with 4 views:
+
+```{r c1}
+v4 <- Views(dd2, start=3:0, end=5:8)
+v4
+length(v4)
+```
+
+Note that the 2 last views are *out of limits*.
+
+You can select a subset of views from an *XStringViews* object:
+
+```{r c3}
+v4[4:2]
+```
+
+The returned object is still an *XStringViews* object, even if we select
+only one element. You need to use double-brackets to extract a given
+view as an *XString* object:
+
+```{r c4}
+v4[[2]]
+```
+
+You can't extract a view that is *out of limits*:
+
+```{r c6,echo=FALSE}
+cat('> v4[[3]]')
+cat(try(v4[[3]], silent=TRUE))
+```
+
+Note that, when `start` and `end` are numeric vectors and `i` is a
+*single* integer, `Views(b, start, end)[[i]]` is equivalent to
+`subseq(b, start[i], end[i])`.
+
+Subsetting also works with negative or logical values with the expected
+semantic (the same as for $R$ built-in vectors):
+
+```{r c7}
+v4[-3]
+v4[c(TRUE, FALSE)]
+```
+
+Note that the logical vector is recycled to the length of `v4`.
+
+# A few more *XStringViews* objects
+
+12 views (all of the same width):
+
+```{r d1}
+v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11)
+```
+
+This is the same as doing `Views(d, start=1, end=length(d))`:
+
+```{r d2, results="hide"}
+as(d, "Views")
+```
+
+Hence the following will always return the `d` object itself:
+
+```{r d3, results="hide"}
+as(d, "Views")[[1]]
+```
+
+3 *XStringViews* objects with no view:
+
+```{r d4, results="hide"}
+v12[0]
+v12[FALSE]
+Views(d)
+```
+
+# The `==` binary operator for *XStringViews* objects
+
+This operator is the vectorized version of the `==` operator defined
+previously for *XString* objects:
+
+```{r e1}
+v12 == DNAString("TAA")
+```
+
+To display all the views in `v12` that are equals to a given view, you
+can type $R$ cuties like:
+
+```{r e2}
+v12[v12 == v12[4]]
+v12[v12 == v12[1]]
+```
+
+This is `TRUE`:
+
+```{r e3, results="hide"}
+v12[3] == Views(RNAString("AU"), start=0, end=2)
+```
+
+# The `start`, `end` and `width` methods
+
+```{r f1}
+start(v4)
+end(v4)
+width(v4)
+```
+
+Note that `start(v4)[i]` is equivalent to `start(v4[i])`, except that
+the former will not issue an error if `i` is out of bounds (same for
+`end` and `width` methods).
+
+Also, when `i` is a *single* integer, `width(v4)[i]` is equivalent to
+`length(v4[[i]])` except that the former will not issue an error if `i`
+is out of bounds or if view `v4[i]` is *out of limits*.