From 7adafad15cc608c21d3872ff92a94e89d8e91ef9 Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Sun, 4 Sep 2022 13:07:05 -0400 Subject: [PATCH 1/7] Initial conversion to Rmd --- vignettes/Biostrings2Classes.Rmd | 150 +++++++++++++++++++++++++++++++ 1 file changed, 150 insertions(+) create mode 100644 vignettes/Biostrings2Classes.Rmd diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd new file mode 100644 index 00000000..23eb1598 --- /dev/null +++ b/vignettes/Biostrings2Classes.Rmd @@ -0,0 +1,150 @@ +--- +author: +- Hervé Pagès +title: The *Biostrings* 2 classes (work in progress) +--- + +# Introduction + +This document briefly presents the new set of classes implemented in the +*Biostrings* 2 package. Like the *Biostrings* 1 classes (found in +*Biostrings* v 1.4.x), they were designed to make manipulation of big +strings (like DNA or RNA sequences) easy and fast. This is achieved by +keeping the 3 following ideas from the *Biostrings* 1 package: (1) use R +external pointers to store the string data, (2) use bit patterns to +encode the string data, (3) provide the user with a convenient class of +objects where each instance can store a set of views *on the same* big +string (these views being typically the matches returned by a search +algorithm). + +However, there is a flaw in the *BioString* class design that prevents +the search algorithms to return correct information about the matches +(i.e. the views) that they found. The new classes address this issue by +replacing the *BioString* class (implemented in *Biostrings* 1) by 2 new +classes: (1) the *XString* class used to represent a *single* string, +and (2) the *XStringViews* class used to represent a set of views *on +the same* *XString* object, and by introducing new implementations and +new interfaces for these 2 classes. + +# The *XString* class and its subsetting operator `[` + +The *XString* is in fact a virtual class and therefore cannot be +instanciated. Only subclasses (or subtypes) *BString*, *DNAString*, +*RNAString* and *AAString* can. These classes are direct extensions of +the *XString* class (no additional slot). + +A first *BString* object: \<\\>= library(Biostrings) b \<- +BString(\"I am a BString object\") b length(b) @ + +A *DNAString* object: \<\\>= d \<- DNAString(\"TTGAAAA-CTC-N\") d +length(d) @ The differences with a *BString* object are: (1) only +letters from the *IUPAC extended genetic alphabet* + the gap letter +(`-`) are allowed and (2) each letter in the argument passed to the +`DNAString` function is encoded in a special way before it's stored in +the *DNAString* object. + +Access to the individual letters: \<\\>= d\[3\] d\[7:12\] d\[\] +b\[length(b):1\] @ Only *in bounds* positive numeric subscripts are +supported. + +In fact the subsetting operator for *XString* objects is not efficient +and one should always use the `subseq` method to extract a substring +from a big string: \<\\>= bb \<- subseq(b, 3, 6) dd1 \<- subseq(d, +end=7) dd2 \<- subseq(d, start=8) @ + +To *dump* an *XString* object as a character vector (of length 1), use +the `toString` method: \<\\>= toString(dd2) @ + +Note that `length(dd2)` is equivalent to `nchar(toString(dd2))` but the +latter would be very inefficient on a big *DNAString* object. + +*\[TODO: Make a generic of the substr() function to work with XString +objects. It will be essentially doing toString(subseq()).\]* + +# The `==` binary operator for *XString* objects + +The 2 following comparisons are `TRUE`: \<\\>= bb == +\"am a\" dd2 != DNAString(\"TG\") @ + +When the 2 sides of `==` don't belong to the same class then the side +belonging to the "lowest" class is first converted to an object +belonging to the class of the other side (the "highest" class). The +class (pseudo-)order is *character* \< *BString* \< *DNAString*. When +both sides are *XString* objects of the same subtype (e.g. both are +*DNAString* objects) then the comparison is very fast because it only +has to call the C standard function `memcmp()` and no memory allocation +or string encoding/decoding is required. + +The 2 following expressions provoke an error because the right member +can't be "upgraded" (converted) to an object of the same class than the +left member: \<\\>= cat('> bb == \"\"') cat('> d == bb') +@ + +When comparing an *RNAString* object with a *DNAString* object, U and T +are considered equals: \<\\>= r \<- RNAString(d) r r == d @ + +# The *XStringViews* class and its subsetting operators `[` and `[[` + +An *XStringViews* object contains a set of views *on the same* *XString* +object called the *subject* string. Here is an *XStringViews* object +with 4 views: \<\\>= v4 \<- Views(dd2, start=3:0, end=5:8) v4 +length(v4) @ + +Note that the 2 last views are *out of limits*. + +You can select a subset of views from an *XStringViews* object: +\<\\>= v4\[4:2\] @ + +The returned object is still an *XStringViews* object, even if we select +only one element. You need to use double-brackets to extract a given +view as an *XString* object: \<\\>= v4\[\[2\]\] @ + +You can't extract a view that is *out of limits*: \<\\>= +cat('> v4\[\[3\]\]') cat(try(v4\[\[3\]\], silent=TRUE)) @ + +Note that, when `start` and `end` are numeric vectors and `i` is a +*single* integer, `Views(b, start, end)[[i]]` is equivalent to +`subseq(b, start[i], end[i])`. + +Subsetting also works with negative or logical values with the expected +semantic (the same as for R built-in vectors): \<\\>= v4\[-3\] +v4\[c(TRUE, FALSE)\] @ Note that the logical vector is recycled to the +length of `v4`. + +# A few more *XStringViews* objects + +12 views (all of the same width): \<\\>= v12 \<- +Views(DNAString(\"TAATAATG\"), start=-2:9, end=0:11) @ + +This is the same as doing `Views(d, start=1, end=length(d))`: +\<\\>= as(d, \"Views\") @ + +Hence the following will always return the `d` object itself: +\<\\>= as(d, \"Views\")\[\[1\]\] @ + +3 *XStringViews* objects with no view: \<\\>= v12\[0\] +v12\[FALSE\] Views(d) @ + +# The `==` binary operator for *XStringViews* objects + +This operator is the vectorized version of the `==` operator defined +previously for *XString* objects: \<\\>= v12 == DNAString(\"TAA\") @ + +To display all the views in `v12` that are equals to a given view, you +can type R cuties like: \<\\>= v12\[v12 == v12\[4\]\] v12\[v12 == +v12\[1\]\] @ + +This is `TRUE`: \<\\>= v12\[3\] == +Views(RNAString(\"AU\"), start=0, end=2) @ + +# The `start`, `end` and `width` methods + +\<\\>= start(v4) end(v4) width(v4) @ + +Note that `start(v4)[i]` is equivalent to `start(v4[i])`, except that +the former will not issue an error if `i` is out of bounds (same for +`end` and `width` methods). + +Also, when `i` is a *single* integer, `width(v4)[i]` is equivalent to +`length(v4[[i]])` except that the former will not issue an error if `i` +is out of bounds or if view `v4[i]` is *out of limits*. From 0326e70a2f9749c185ebacfb2794fa8c0c9fedcf Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Wed, 7 Sep 2022 14:02:51 -0400 Subject: [PATCH 2/7] Biostrings2Classes Initial conversion from Rnw to Rmd --- vignettes/Biostrings2Classes.Rmd | 213 +++++++++++++++++++++++-------- 1 file changed, 161 insertions(+), 52 deletions(-) diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd index 23eb1598..814203c8 100644 --- a/vignettes/Biostrings2Classes.Rmd +++ b/vignettes/Biostrings2Classes.Rmd @@ -1,70 +1,119 @@ --- -author: -- Hervé Pagès -title: The *Biostrings* 2 classes (work in progress) +title: "The *Biostrings* 2 classes (work in progress)" +author: "Hervé Pagès" +date: "`r format(Sys.time(), '%B %d, %Y')`" +vignette: > + %\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2} + %\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment} + %\VignettePackage{Biostrings} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +output: + BiocStyle::html_document: + number_sections: true + toc: yes + toc_depth: 4 +editor_options: + markdown: + wrap: 72 --- # Introduction This document briefly presents the new set of classes implemented in the -*Biostrings* 2 package. Like the *Biostrings* 1 classes (found in -*Biostrings* v 1.4.x), they were designed to make manipulation of big +*Biostrings* 2 package. Like the *Biostrings* 1 classes (found in +*Biostrings* v 1.4.x), they were designed to make manipulation of big strings (like DNA or RNA sequences) easy and fast. This is achieved by -keeping the 3 following ideas from the *Biostrings* 1 package: (1) use R +keeping the 3 following ideas from the *Biostrings* 1 package: (1) use R external pointers to store the string data, (2) use bit patterns to encode the string data, (3) provide the user with a convenient class of objects where each instance can store a set of views *on the same* big -string (these views being typically the matches returned by a search +string (these views being typicallythe matches returned by a search algorithm). However, there is a flaw in the *BioString* class design that prevents the search algorithms to return correct information about the matches (i.e. the views) that they found. The new classes address this issue by -replacing the *BioString* class (implemented in *Biostrings* 1) by 2 new +replacing the *BioString* class (implemented in *Biostrings* 1) by 2 new classes: (1) the *XString* class used to represent a *single* string, and (2) the *XStringViews* class used to represent a set of views *on -the same* *XString* object, and by introducing new implementations and +the same* *XString* object, and by introducing new implementations and new interfaces for these 2 classes. -# The *XString* class and its subsetting operator `[` +# The *XString* class and its subsetting operator `[` The *XString* is in fact a virtual class and therefore cannot be instanciated. Only subclasses (or subtypes) *BString*, *DNAString*, *RNAString* and *AAString* can. These classes are direct extensions of the *XString* class (no additional slot). -A first *BString* object: \<\\>= library(Biostrings) b \<- -BString(\"I am a BString object\") b length(b) @ +A first *BString* object: -A *DNAString* object: \<\\>= d \<- DNAString(\"TTGAAAA-CTC-N\") d -length(d) @ The differences with a *BString* object are: (1) only -letters from the *IUPAC extended genetic alphabet* + the gap letter -(`-`) are allowed and (2) each letter in the argument passed to the -`DNAString` function is encoded in a special way before it's stored in -the *DNAString* object. +```{r a1a, message=FALSE} +library(Biostrings) +``` -Access to the individual letters: \<\\>= d\[3\] d\[7:12\] d\[\] -b\[length(b):1\] @ Only *in bounds* positive numeric subscripts are -supported. +```{r a1b} +b <- BString("I am a BString object") +b +length(b) +``` + +A *DNAString* object: + +```{r a2} +d <- DNAString("TTGAAAA-CTC-N") +d +length(d) +``` + +The differences with a *BString* object are: (1) only letters from the +*IUPAC extended genetic alphabet* + the gap letter (`-`) are allowed and +(2) each letter in the argument passed to the `DNAString` function is +encoded in a special way before it's stored in the *DNAString* object. + +Access to the individual letters: + +```{r a3} +d[3] +d[7:12] +d[] +b[length(b):1] +``` + +Only *in bounds* positive numeric subscripts are supported In fact the subsetting operator for *XString* objects is not efficient and one should always use the `subseq` method to extract a substring -from a big string: \<\\>= bb \<- subseq(b, 3, 6) dd1 \<- subseq(d, -end=7) dd2 \<- subseq(d, start=8) @ +from a big string: + +```{r a4} +bb <- subseq(b, 3, 6) +dd1 <- subseq(d, end=7) +dd2 <- subseq(d, start=8) +``` To *dump* an *XString* object as a character vector (of length 1), use -the `toString` method: \<\\>= toString(dd2) @ +the `toString` method: + +```{r a5} +toString(dd2) +``` Note that `length(dd2)` is equivalent to `nchar(toString(dd2))` but the latter would be very inefficient on a big *DNAString* object. -*\[TODO: Make a generic of the substr() function to work with XString -objects. It will be essentially doing toString(subseq()).\]* +_**TODO: Make a generic of the substr() function to work with XString +objects. It will be essentially doing toString(subseq()).**_ # The `==` binary operator for *XString* objects -The 2 following comparisons are `TRUE`: \<\\>= bb == -\"am a\" dd2 != DNAString(\"TG\") @ +The 2 following comparisons are `TRUE`: + +```{r b1, results="hide"} +bb == "am a" +dd2 != DNAString("TG") +``` When the 2 sides of `==` don't belong to the same class then the side belonging to the "lowest" class is first converted to an object @@ -77,69 +126,129 @@ or string encoding/decoding is required. The 2 following expressions provoke an error because the right member can't be "upgraded" (converted) to an object of the same class than the -left member: \<\\>= cat('> bb == \"\"') cat('> d == bb') -@ +left member: + +```{r b2, echo=FALSE} +cat('> bb == ""') +cat('> d == bb') +``` When comparing an *RNAString* object with a *DNAString* object, U and T -are considered equals: \<\\>= r \<- RNAString(d) r r == d @ +are considered equals: + +```{r b3} +r <- RNAString(d) +r +r == d +``` # The *XStringViews* class and its subsetting operators `[` and `[[` An *XStringViews* object contains a set of views *on the same* *XString* object called the *subject* string. Here is an *XStringViews* object -with 4 views: \<\\>= v4 \<- Views(dd2, start=3:0, end=5:8) v4 -length(v4) @ +with 4 views: + +```{r c1} +v4 <- Views(dd2, start=3:0, end=5:8) +v4 +length(v4) +``` Note that the 2 last views are *out of limits*. You can select a subset of views from an *XStringViews* object: -\<\\>= v4\[4:2\] @ + +```{r c3} +v4[4:2] +``` The returned object is still an *XStringViews* object, even if we select only one element. You need to use double-brackets to extract a given -view as an *XString* object: \<\\>= v4\[\[2\]\] @ +view as an *XString* object: + +```{r c4} +v4[[2]] +``` + +You can't extract a view that is *out of limits*: -You can't extract a view that is *out of limits*: \<\\>= -cat('> v4\[\[3\]\]') cat(try(v4\[\[3\]\], silent=TRUE)) @ +```{r c6,echo=FALSE} +cat('> v4[[3]]') +cat(try(v4[[3]], silent=TRUE)) +``` Note that, when `start` and `end` are numeric vectors and `i` is a *single* integer, `Views(b, start, end)[[i]]` is equivalent to `subseq(b, start[i], end[i])`. Subsetting also works with negative or logical values with the expected -semantic (the same as for R built-in vectors): \<\\>= v4\[-3\] -v4\[c(TRUE, FALSE)\] @ Note that the logical vector is recycled to the -length of `v4`. +semantic (the same as for R built-in vectors): + +```{r c7} +v4[-3] +v4[c(TRUE, FALSE)] +``` + +Note that the logical vector is recycled to the length of `v4`. # A few more *XStringViews* objects -12 views (all of the same width): \<\\>= v12 \<- -Views(DNAString(\"TAATAATG\"), start=-2:9, end=0:11) @ +12 views (all of the same width): + +```{r d1} +v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11) +``` This is the same as doing `Views(d, start=1, end=length(d))`: -\<\\>= as(d, \"Views\") @ + +```{r d2, results="hide"} +as(d, "Views") +``` Hence the following will always return the `d` object itself: -\<\\>= as(d, \"Views\")\[\[1\]\] @ -3 *XStringViews* objects with no view: \<\\>= v12\[0\] -v12\[FALSE\] Views(d) @ +```{r d3, results="hide"} +as(d, "Views")[[1]] +``` + +3 *XStringViews* objects with no view: + +```{r d4, results="hide"} +v12[0] +v12[FALSE] +Views(d) +``` # The `==` binary operator for *XStringViews* objects This operator is the vectorized version of the `==` operator defined -previously for *XString* objects: \<\\>= v12 == DNAString(\"TAA\") @ +previously for *XString* objects: + +```{r e1} +v12 == DNAString("TAA") +``` To display all the views in `v12` that are equals to a given view, you -can type R cuties like: \<\\>= v12\[v12 == v12\[4\]\] v12\[v12 == -v12\[1\]\] @ +can type R cuties like: + +```{r e2} +v12[v12 == v12[4]] +v12[v12 == v12[1]] +``` + +This is `TRUE`: -This is `TRUE`: \<\\>= v12\[3\] == -Views(RNAString(\"AU\"), start=0, end=2) @ +```{r e3, results="hide"} +v12[3] == Views(RNAString("AU"), start=0, end=2) +``` # The `start`, `end` and `width` methods -\<\\>= start(v4) end(v4) width(v4) @ +```{r f1} +start(v4) +end(v4) +width(v4) +``` Note that `start(v4)[i]` is equivalent to `start(v4[i])`, except that the former will not issue an error if `i` is out of bounds (same for From c344433b00a14e0b57c0b6cd803412a42b96a67f Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Tue, 22 Nov 2022 10:31:15 -0500 Subject: [PATCH 3/7] Biostrings2classes Rnw to Rmd --- vignettes/Biostrings2Classes.Rmd | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd index 814203c8..e8010935 100644 --- a/vignettes/Biostrings2Classes.Rmd +++ b/vignettes/Biostrings2Classes.Rmd @@ -1,6 +1,9 @@ --- title: "The *Biostrings* 2 classes (work in progress)" -author: "Hervé Pagès" +author: +- name: "Hervé Pagès" +- name: "Paul Villafuerte" + affiliation: "Vignette translation from Sweave to Rmarkdown / HTML" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2} @@ -17,7 +20,7 @@ editor_options: markdown: wrap: 72 --- - + # Introduction This document briefly presents the new set of classes implemented in the @@ -81,7 +84,7 @@ d[] b[length(b):1] ``` -Only *in bounds* positive numeric subscripts are supported +Only *in bounds* positive numeric subscripts are supported. In fact the subsetting operator for *XString* objects is not efficient and one should always use the `subseq` method to extract a substring @@ -103,8 +106,8 @@ toString(dd2) Note that `length(dd2)` is equivalent to `nchar(toString(dd2))` but the latter would be very inefficient on a big *DNAString* object. -_**TODO: Make a generic of the substr() function to work with XString -objects. It will be essentially doing toString(subseq()).**_ +_**[TODO: Make a generic of the substr() function to work with XString +objects. It will be essentially doing toString(subseq()).]**_ # The `==` binary operator for *XString* objects From ddd00b13e2e2d5b80e2fba7d4844aaa769d634e6 Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Tue, 22 Nov 2022 10:34:50 -0500 Subject: [PATCH 4/7] Biostrings2classes Rnw to Rmd #2 --- vignettes/Biostrings2Classes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd index e8010935..373a1455 100644 --- a/vignettes/Biostrings2Classes.Rmd +++ b/vignettes/Biostrings2Classes.Rmd @@ -17,7 +17,7 @@ output: toc: yes toc_depth: 4 editor_options: - markdown: + markdown: wrap: 72 --- From 45d0d6e90a0d1a6eef216a2a607221c149d06363 Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Tue, 22 Nov 2022 10:42:23 -0500 Subject: [PATCH 5/7] Biostrings2classes Rnw to Rmd #2 --- vignettes/Biostrings2Classes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd index 373a1455..77b9f676 100644 --- a/vignettes/Biostrings2Classes.Rmd +++ b/vignettes/Biostrings2Classes.Rmd @@ -18,7 +18,7 @@ output: toc_depth: 4 editor_options: markdown: - wrap: 72 + wrap: 72 --- # Introduction From eb7562c45a3e404943ab280d85c9a15717c2f41a Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Wed, 5 Jul 2023 17:27:38 -0400 Subject: [PATCH 6/7] Biostrings2Classes.Rnw to .Rmd, remove .Rnw --- vignettes/Biostrings2Classes.Rnw | 294 ------------------------------- 1 file changed, 294 deletions(-) delete mode 100644 vignettes/Biostrings2Classes.Rnw diff --git a/vignettes/Biostrings2Classes.Rnw b/vignettes/Biostrings2Classes.Rnw deleted file mode 100644 index 103da3f7..00000000 --- a/vignettes/Biostrings2Classes.Rnw +++ /dev/null @@ -1,294 +0,0 @@ -%\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2} -%\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment} -%\VignettePackage{Biostrings} - -% -% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is -% likely to be overwritten. -% -\documentclass[11pt]{article} - -%\usepackage[authoryear,round]{natbib} -%\usepackage{hyperref} - - -\textwidth=6.2in -\textheight=8.5in -%\parskip=.3cm -\oddsidemargin=.1in -\evensidemargin=.1in -\headheight=-.3in - -\newcommand{\scscst}{\scriptscriptstyle} -\newcommand{\scst}{\scriptstyle} - - -\newcommand{\Rfunction}[1]{{\texttt{#1}}} -\newcommand{\Robject}[1]{{\texttt{#1}}} -\newcommand{\Rpackage}[1]{{\textit{#1}}} -\newcommand{\Rmethod}[1]{{\texttt{#1}}} -\newcommand{\Rfunarg}[1]{{\texttt{#1}}} -\newcommand{\Rclass}[1]{{\textit{#1}}} - -\textwidth=6.2in - -\bibliographystyle{plainnat} - -\begin{document} -%\setkeys{Gin}{width=0.55\textwidth} - -\title{The \Rpackage{Biostrings}~2 classes (work in progress)} -\author{Herv\'e Pag\`es} -\maketitle - -\tableofcontents - - -% --------------------------------------------------------------------------- - -\section{Introduction} - -This document briefly presents the new set of classes implemented in the -\Rpackage{Biostrings}~2 package. -Like the \Rpackage{Biostrings}~1 classes (found in \Rpackage{Biostrings} -v~1.4.x), they were designed to make manipulation of big strings (like DNA -or RNA sequences) easy and fast. -This is achieved by keeping the 3 following ideas from the -\Rpackage{Biostrings}~1 package: -(1) use R external pointers to store the string data, -(2) use bit patterns to encode the string data, -(3) provide the user with a convenient class of objects where each instance - can store a set of views {\it on the same} big string (these views being - typically the matches returned by a search algorithm). - -However, there is a flaw in the \Rclass{BioString} class design -that prevents the search algorithms to return correct information -about the matches (i.e. the views) that they found. -The new classes address this issue by replacing the \Rclass{BioString} -class (implemented in \Rpackage{Biostrings}~1) by 2 new classes: -(1) the \Rclass{XString} class used to represent a {\it single} string, and -(2) the \Rclass{XStringViews} class used to represent a set of views - {\it on the same} \Rclass{XString} object, and by introducing new - implementations and new interfaces for these 2 classes. - - -% --------------------------------------------------------------------------- - -\section{The \Rclass{XString} class and its subsetting operator~\Rmethod{[}} - -The \Rclass{XString} is in fact a virtual class and therefore cannot be -instanciated. Only subclasses (or subtypes) \Rclass{BString}, -\Rclass{DNAString}, \Rclass{RNAString} and \Rclass{AAString} can. -These classes are direct extensions of the \Rclass{XString} class (no -additional slot). - -A first \Rclass{BString} object: -<>= -library(Biostrings) -b <- BString("I am a BString object") -b -length(b) -@ - -A \Rclass{DNAString} object: -<>= -d <- DNAString("TTGAAAA-CTC-N") -d -length(d) -@ -The differences with a \Rclass{BString} object are: (1) only letters from the -{\it IUPAC extended genetic alphabet} + the gap letter ({\tt -}) are allowed -and (2) each letter in the argument passed to the \Rfunction{DNAString} -function is encoded in a special way before it's stored in the -\Rclass{DNAString} object. - -Access to the individual letters: -<>= -d[3] -d[7:12] -d[] -b[length(b):1] -@ -Only {\it in bounds} positive numeric subscripts are supported. - -In fact the subsetting operator for \Rclass{XString} objects is not efficient -and one should always use the \Rmethod{subseq} method to extract a substring -from a big string: -<>= -bb <- subseq(b, 3, 6) -dd1 <- subseq(d, end=7) -dd2 <- subseq(d, start=8) -@ - -To {\it dump} an \Rclass{XString} object as a character vector (of length 1), -use the \Rmethod{toString} method: -<>= -toString(dd2) -@ - -Note that \Robject{length(dd2)} is equivalent to -\Robject{nchar(toString(dd2))} but the latter would be very inefficient -on a big \Rclass{DNAString} object. - -{\it [TODO: Make a generic of the substr() function to work with -XString objects. It will be essentially doing toString(subseq()).]} - - -% --------------------------------------------------------------------------- - -\section{The \Rmethod{==} binary operator for \Rclass{XString} objects} - -The 2 following comparisons are \Robject{TRUE}: -<>= -bb == "am a" -dd2 != DNAString("TG") -@ - -When the 2 sides of \Rmethod{==} don't belong to the same class -then the side belonging to the ``lowest'' class is first converted -to an object belonging to the class of the other side (the ``highest'' class). -The class (pseudo-)order is \Rclass{character} < \Rclass{BString} < \Rclass{DNAString}. -When both sides are \Rclass{XString} objects of the same subtype (e.g. both -are \Rclass{DNAString} objects) then the comparison is very fast because it -only has to call the C standard function {\tt memcmp()} and no memory allocation -or string encoding/decoding is required. - -The 2 following expressions provoke an error because the right member can't -be ``upgraded'' (converted) to an object of the same class than the left member: -<>= -cat('> bb == ""') -cat('> d == bb') -@ - -When comparing an \Rclass{RNAString} object with a \Rclass{DNAString} object, -U and T are considered equals: -<>= -r <- RNAString(d) -r -r == d -@ - - -% --------------------------------------------------------------------------- - -\section{The \Rclass{XStringViews} class and its subsetting -operators~\Rmethod{[} and~\Rmethod{[[}} - -An \Rclass{XStringViews} object contains a set of views {\it on the same} -\Rclass{XString} object called the {\it subject} string. -Here is an \Rclass{XStringViews} object with 4 views: -<>= -v4 <- Views(dd2, start=3:0, end=5:8) -v4 -length(v4) -@ - -Note that the 2 last views are {\it out of limits}. - -You can select a subset of views from an \Rclass{XStringViews} object: -<>= -v4[4:2] -@ - -The returned object is still an \Rclass{XStringViews} object, -even if we select only one element. -You need to use double-brackets to extract a given view -as an \Rclass{XString} object: -<>= -v4[[2]] -@ - -You can't extract a view that is {\it out of limits}: -<>= -cat('> v4[[3]]') -cat(try(v4[[3]], silent=TRUE)) -@ - -Note that, when \Robject{start} and \Robject{end} are numeric -vectors and \Robject{i} is a {\it single} integer, -\Robject{Views(b, start, end)[[i]]} -is equivalent to \Robject{subseq(b, start[i], end[i])}. - -Subsetting also works with negative or logical values with the expected -semantic (the same as for R built-in vectors): -<>= -v4[-3] -v4[c(TRUE, FALSE)] -@ -Note that the logical vector is recycled to the length of \Robject{v4}. - - -% --------------------------------------------------------------------------- - -\section{A few more \Rclass{XStringViews} objects} - -12 views (all of the same width): -<>= -v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11) -@ - -This is the same as doing \Robject{Views(d, start=1, end=length(d))}: -<>= -as(d, "Views") -@ - -Hence the following will always return the \Robject{d} object itself: -<>= -as(d, "Views")[[1]] -@ - -3 \Rclass{XStringViews} objects with no view: -<>= -v12[0] -v12[FALSE] -Views(d) -@ - - -% --------------------------------------------------------------------------- - -\section{The \Rmethod{==} binary operator for \Rclass{XStringViews} objects} - -This operator is the vectorized version of the \Rmethod{==} operator -defined previously for \Rclass{XString} objects: -<>= -v12 == DNAString("TAA") -@ - -To display all the views in \Robject{v12} that are equals to a given view, -you can type R cuties like: -<>= -v12[v12 == v12[4]] -v12[v12 == v12[1]] -@ - -This is \Robject{TRUE}: -<>= -v12[3] == Views(RNAString("AU"), start=0, end=2) -@ - - -% --------------------------------------------------------------------------- - -\section{The \Rmethod{start}, \Rmethod{end} and \Rmethod{width} -methods} - -<>= -start(v4) -end(v4) -width(v4) -@ - -Note that \Robject{start(v4)[i]} is equivalent to -\Robject{start(v4[i])}, except that the former will not issue -an error if \Robject{i} is out of bounds -(same for \Rmethod{end} and \Rmethod{width} methods). - -Also, when \Robject{i} is a {\it single} integer, -\Robject{width(v4)[i]} is equivalent to \Robject{length(v4[[i]])} -except that the former will not issue an error -if \Robject{i} is out of bounds or if view \Robject{v4[i]} -is {\it out of limits}. - - -\end{document} From 10d42963406aac6c7ee810e1b736dd49cd5dbb7b Mon Sep 17 00:00:00 2001 From: Paul Villafuerte Date: Thu, 6 Jul 2023 10:26:51 -0400 Subject: [PATCH 7/7] Biostrings2Classes.Rnw to .Rmd --- vignettes/Biostrings2Classes.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/vignettes/Biostrings2Classes.Rmd b/vignettes/Biostrings2Classes.Rmd index 77b9f676..2bee0e7b 100644 --- a/vignettes/Biostrings2Classes.Rmd +++ b/vignettes/Biostrings2Classes.Rmd @@ -24,21 +24,21 @@ editor_options: # Introduction This document briefly presents the new set of classes implemented in the -*Biostrings* 2 package. Like the *Biostrings* 1 classes (found in -*Biostrings* v 1.4.x), they were designed to make manipulation of big +`r Biocpkg('Biostrings')` 2 package. Like the `r Biocpkg('Biostrings')` +1 classes (found in `r Biocpkg('Biostrings')` v 1.4.x), they were designed to make manipulation of big strings (like DNA or RNA sequences) easy and fast. This is achieved by -keeping the 3 following ideas from the *Biostrings* 1 package: (1) use R +keeping the 3 following ideas from the `r Biocpkg('Biostrings')` 1 package: (1) use R external pointers to store the string data, (2) use bit patterns to encode the string data, (3) provide the user with a convenient class of objects where each instance can store a set of views *on the same* big string (these views being typicallythe matches returned by a search algorithm). -However, there is a flaw in the *BioString* class design that prevents +However, there is a flaw in the `r Biocpkg('Biostring')` class design that prevents the search algorithms to return correct information about the matches (i.e. the views) that they found. The new classes address this issue by -replacing the *BioString* class (implemented in *Biostrings* 1) by 2 new -classes: (1) the *XString* class used to represent a *single* string, +replacing the `r Biocpkg('Biostrings')` class (implemented in `r Biocpkg('Biostrings')` 1) by 2 new +classes: (1) the `*XString* class used to represent a *single* string, and (2) the *XStringViews* class used to represent a set of views *on the same* *XString* object, and by introducing new implementations and new interfaces for these 2 classes. @@ -185,7 +185,7 @@ Note that, when `start` and `end` are numeric vectors and `i` is a `subseq(b, start[i], end[i])`. Subsetting also works with negative or logical values with the expected -semantic (the same as for R built-in vectors): +semantic (the same as for $R$ built-in vectors): ```{r c7} v4[-3] @@ -232,7 +232,7 @@ v12 == DNAString("TAA") ``` To display all the views in `v12` that are equals to a given view, you -can type R cuties like: +can type $R$ cuties like: ```{r e2} v12[v12 == v12[4]]