Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Issue #65 #112

Draft
wants to merge 5 commits into
base: devel
Choose a base branch
from
Draft

fix: Issue #65 #112

wants to merge 5 commits into from

Conversation

ahl27
Copy link
Collaborator

@ahl27 ahl27 commented Jun 8, 2024

THIS IS A WIP

Early stage work-in-progress solution for discussion in #65.

Goals:

  1. solve misleading "error" when BString object contains a null byte (as.raw(0))
  2. improve UX for converting between BString and raw/integer vectors

The issue in (1) is that a BString containing a null byte throws an error whenever R tries to display the position(s) containins the null byte. This leads users to think that the object itself is corrupted, when in fact it's purely an error with R attempting to display it on the console. as.integer() and other methods work properly on these objects.

This issue would be more obvious if more people used BString objects with values in the complete 0:255 range, but building a BString from bytes is currently very difficult. Helper functions to create BStrings from raw vectors would be preferable.

This PR will be blocked from merging until it's actually finalized; currently displaying to organize thoughts and reprexes.

@ahl27 ahl27 marked this pull request as draft June 8, 2024 21:59
@ahl27
Copy link
Collaborator Author

ahl27 commented Jun 8, 2024

Illustration of current functionality in this PR:

> ## going to have to improve this workflow, but for now we're stuck with as(as(as.raw(...), "XRaw"), "BString")
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
>
> ## Here's the misleading error
> x
256-letter BString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#'
>
> ## note that the backend values are still correct
> as.integer(x[1:10])
 [1] 0 1 2 3 4 5 6 7 8 9
>
> ## New functionality, activated by setting this option to TRUE
> options(Biostrings.showRaw=TRUE)
> x
256-letter BString object
seq: ⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !"#...⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
> as.integer(x[1:10])
 [1] 0 1 2 3 4 5 6 7 8 9
> as.character(x)
[1] "⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⡿⢀⢁⢂⢃⢄⢅⢆⢇⢈⢉⢊⢋⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙⢚⢛⢜⢝⢞⢟⢠⢡⢢⢣⢤⢥⢦⢧⢨⢩⢪⢫⢬⢭⢮⢯⢰⢱⢲⢳⢴⢵⢶⢷⢸⢹⢺⢻⢼⢽⢾⢿⣀⣁⣂⣃⣄⣅⣆⣇⣈⣉⣊⣋⣌⣍⣎⣏⣐⣑⣒⣓⣔⣕⣖⣗⣘⣙⣚⣛⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿"
> 
> ## Consistency with other objects, showing expected behavior on a DNAString
> y <- DNAString(paste(DNA_ALPHABET, collapse=''))
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=TRUE)
[1] "ACTMR"
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=FALSE)
[1] "AC"  "TMR"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=TRUE)
[1] "ATY"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=FALSE)
[1] "A" "T" "Y"
> 
> ## Now for BString
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=TRUE)
[1] "⣿⡠⡡⡢⡣CDEFGHIJKL⡿⢀⢁⢂⢃"
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=FALSE)
[1] "⣿⡠⡡⡢⡣"      "CDEFGHIJKL" "⡿⢀⢁⢂⢃"     
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=TRUE)
[1] "⣿⡢⡧C⡿"
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=FALSE)
[1] "⣿" "⡢" "⡧" "C" "⡿"

Invalid characters are displayed differently depending on machine support:

  • By default, all undisplayable characters map to ?.
  • If unicode is supported, the braille character set is used to display values. Each bytevalue in 1:255 maps to a unique character. Note that 0 also maps to 255 due to machine restrictions. This could be solved, but would require an additional call to XVector:::extract_character_from_XRaw_by_positions or XVector:::extract_character_from_XRaw_by_ranges to be able to identify null bytes.
  • If unicode is not supported but multibyte characters are, all undisplayable characters map to . While this doesn't allow distinguishing undisplayable characters, it does allow distinguishing all displayable characters from all undisplayable ones.

Why do we care about which characters we can distinguish? These methods support conversions like as.character. Ideally, people would use built-in methods for comparing strings, but people may want to work with character strings directly. In the current setup, character conversions of BString objects aren't great--undisplayable values map to multibyte characters, which can cause lots of errors when trying to work with them as strings. Having a as.character conversion between undisplayable bytes and parseable characters lets us work with them as strings, even if that isn't quite as clean as using the Biostrings toolbox directly.


Another avenue to consider is to instead use something like as.raw to directly convert the values of the BString object and work with them that way. For instance, the following would also work:

> ## BSTRING_RAW_LOOKUP defined in zzz.R
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
> test <- as.integer(as.raw(x))
> test[test==0] <- 256L
> paste(BSTRING_RAW_LOOKUP[test], collapse='')

We could use XVector:::subseq to pull out subranges of the object, convert to raw, and then map them to characters manually.

There's a bit of a rabbit hole we could fall into with this fix. Ideally, we just fix the show method to display non-displayable characters correctly and move on. I think there's a (valid) argument to be made that XString objects are intended for strings, not for arbitrary input. Thus, does it make sense to add a lot of additional functionality for objects that aren't strings? Would it not make more sense to encourage usage of XRaw if people are using lots of raw vectors?

@ahl27 ahl27 mentioned this pull request Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant