forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstrings.qmd
639 lines (466 loc) · 24.1 KB
/
strings.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
# Strings {#sec-strings}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
## Introduction
So far, you've used a bunch of strings without learning much about the details.
Now it's time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.
We'll begin with the details of creating strings and character vectors.
You'll then dive into creating strings from data, then the opposite; extracting strings from data.
We'll then discuss tools that work with individual letters.
The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
We'll keep working with strings in the next chapter, where you'll learn more about the power of regular expressions.
### Prerequisites
In this chapter, we'll use functions from the stringr package, which is part of the core tidyverse.
We'll also use the babynames data since it provides some fun strings to manipulate.
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(babynames)
```
You can quickly tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio because typing `str_` will trigger autocomplete, allowing you to jog your memory of the available functions.
```{r}
#| echo: false
knitr::include_graphics("screenshots/stringr-autocomplete.png")
```
## Creating a string
We've created strings in passing earlier in the book but didn't discuss the details.
Firstly, you can create a string using either single quotes (`'`) or double quotes (`"`).
There's no difference in behavior between the two, so in the interests of consistency, the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.
```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```
If you forget to close a quote, you'll see `+`, the continuation prompt:
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK IN A STRING
If this happens to you and you can't figure out which quote to close, press Escape to cancel and try again.
### Escapes
To include a literal single or double quote in a string, you can use `\` to "escape" it:
```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
So if you want to include a literal backslash in your string, you'll need to escape it: `"\\"`:
```{r}
backslash <- "\\"
```
Beware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
To see the raw contents of the string, use `str_view()`[^strings-1]:
[^strings-1]: Or use the base R function `writeLines()`.
```{r}
x <- c(single_quote, double_quote, backslash)
x
str_view(x)
```
### Raw strings {#sec-raw-strings}
Creating a string with multiple quotes or backslashes gets confusing quickly.
To illustrate the problem, let's create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:
```{r}
tricky <- "double_quote <- \"\\\"\" # or '\"'
single_quote <- '\\'' # or \"'\""
str_view(tricky)
```
That's a lot of backslashes!
(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping, you can instead use a **raw string**[^strings-2]:
[^strings-2]: Available in R 4.0.0 and above.
```{r}
tricky <- r"(double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'")"
str_view(tricky)
```
A raw string usually starts with `r"(` and finishes with `)"`.
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
### Other special characters
As well as `\"`, `\'`, and `\\`, there are a handful of other special characters that may come in handy. The most common are `\n`, a new line, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in `?Quotes`.
```{r}
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x
str_view(x)
```
Note that `str_view()` uses a blue background for tabs to make them easier to spot.
One of the challenges of working with text is that there's a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.
### Exercises
1. Create strings that contain the following values:
1. `He said "That's amazing!"`
2. `\a\b\c\d`
3. `\\\\\\`
2. Create the string in your R session and print it.
What happens to the special "\\u00a0"?
How does `str_view()` display it?
Can you do a little googling to figure out what this special character is?
```{r}
x <- "This\u00a0is\u00a0tricky"
```
## Creating many strings from data
Now that you've learned the basics of creating a string or two by "hand", we'll go into the details of creating strings from other strings.
This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame.
For example, you might combine "Hello" with a `name` variable to create a greeting.
We'll show you how to do this with `str_c()` and `str_glue()` and how you can use them with `mutate()`.
That naturally raises the question of what string functions you might use with `summarize()`, so we'll finish this section with a discussion of `str_flatten()`, which is a summary function for strings.
### `str_c()`
`str_c()` takes any number of vectors as arguments and returns a character vector:
```{r}
str_c("x", "y")
str_c("x", "y", "z")
str_c("Hello ", c("John", "Susan"))
```
`str_c()` is very similar to the base `paste0()`, but is designed to be used with `mutate()` by obeying the usual tidyverse rules for recycling and propagating missing values:
```{r}
df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
```
If you want missing values to display in another way, use `coalesce()` to replace them.
Depending on what you want, you might use it either inside or outside of `str_c()`:
```{r}
df |>
mutate(
greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
)
```
### `str_glue()` {#sec-glue}
If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-3]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes:
[^strings-3]: If you're not using stringr, you can also access it directly with `glue::glue()`.
```{r}
df |> mutate(greeting = str_glue("Hi {name}!"))
```
As you can see, `str_glue()` currently converts missing values to the string `"NA"` unfortunately making it inconsistent with `str_c()`.
You also might wonder what happens if you need to include a regular `{` or `}` in your string.
You're on the right track if you guess you'll need to escape it somehow.
The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like `\`, you double up the special characters:
```{r}
df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
```
### `str_flatten()`
`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
What if you want a function that works well with `summarize()`, i.e. something that always returns a single string?
That's the job of `str_flatten()`[^strings-4]: it takes a character vector and combines each element of the vector into a single string:
[^strings-4]: The base R equivalent is `paste()` used with the `collapse` argument.
```{r}
str_flatten(c("x", "y", "z"))
str_flatten(c("x", "y", "z"), ", ")
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
```
This makes it work well with `summarize()`:
```{r}
df <- tribble(
~ name, ~ fruit,
"Carmen", "banana",
"Carmen", "apple",
"Marvin", "nectarine",
"Terence", "cantaloupe",
"Terence", "papaya",
"Terence", "madarin"
)
df |>
group_by(name) |>
summarize(fruits = str_flatten(fruit, ", "))
```
### Exercises
1. Compare and contrast the results of `paste0()` with `str_c()` for the following inputs:
```{r}
#| eval: false
str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])
```
2. What's the difference between `paste()` and `paste0()`?
How can you recreate the equivalent of `paste()` with `str_c()`?
3. Convert the following expressions from `str_c()` to `str_glue()` or vice versa:
a. `str_c("The price of ", food, " is ", price)`
b. `str_glue("I'm {age} years old and live in {country}")`
c. `str_c("\\section{", title, "}")`
## Extracting data from strings
It's very common for multiple variables to be crammed together into a single string.
In this section, you'll learn how to use four tidyr functions to extract them:
- `df |> separate_longer_delim(col, delim)`
- `df |> separate_longer_position(col, width)`
- `df |> separate_wider_delim(col, delim, names)`
- `df |> separate_wider_position(col, widths)`
If you look closely, you can see there's a common pattern here: `separate_`, then `longer` or `wider`, then `_`, then by `delim` or `position`.
That's because these four functions are composed of two simpler primitives:
- Just like with `pivot_longer()` and `pivot_wider()`, `_longer` functions make the input data frame longer by creating new rows and `_wider` functions make the input data frame wider by generating new columns.
- `delim` splits up a string with a delimiter like `", "` or `" "`; `position` splits at specified widths, like `c(3, 5, 2)`.
We'll return to the last member of this family, `separate_regex_wider()`, in @sec-regular-expressions.
It's the most flexible of the `wider` functions, but you need to know something about regular expressions before you can use it.
The following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns.
We'll finish off by discussing the tools that the `wider` functions give you to diagnose problems.
### Separating into rows
Separating a string into rows tends to be most useful when the number of components varies from row to row.
The most common case is requiring `separate_longer_delim()` to split based on a delimiter:
```{r}
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |>
separate_longer_delim(x, delim = ",")
```
It's rarer to see `separate_longer_position()` in the wild, but some older datasets do use a very compact format where each character is used to record a value:
```{r}
df2 <- tibble(x = c("1211", "131", "21"))
df2 |>
separate_longer_position(x, width = 1)
```
### Separating into columns {#sec-string-columns}
Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are slightly more complicated than their `longer` equivalents because you need to name the columns.
For example, in this following dataset, `x` is made up of a code, an edition number, and a year, separated by `"."`.
To use `separate_wider_delim()`, we supply the delimiter and the names in two arguments:
```{r}
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |>
separate_wider_delim(
x,
delim = ".",
names = c("code", "edition", "year")
)
```
If a specific piece is not useful you can use an `NA` name to omit it from the results:
```{r}
df3 |>
separate_wider_delim(
x,
delim = ".",
names = c("code", NA, "year")
)
```
`separate_wider_position()` works a little differently because you typically want to specify the width of each column.
So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies.
You can omit values from the output by not naming them:
```{r}
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 |>
separate_wider_position(
x,
widths = c(year = 4, age = 2, state = 2)
)
```
### Diagnosing widening problems
`separate_wider_delim()`[^strings-5] requires a fixed and known set of columns.
What happens if some of the rows don't have the expected number of pieces?
There are two possible problems, too few or too many pieces, so `separate_wider_delim()` provides two arguments to help: `too_few` and `too_many`. Let's first look at the `too_few` case with the following sample dataset:
[^strings-5]: The same principles apply to `separate_wider_position()` and `separate_wider_regex()`.
```{r}
#| error: true
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z")
)
```
You'll notice that we get an error, but the error gives us some suggestions on how you might proceed.
Let's start by debugging the problem:
```{r}
debug <- df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_few = "debug"
)
debug
```
When you use the debug mode, you get three extra columns added to the output: `x_ok`, `x_pieces`, and `x_remainder` (if you separate a variable with a different name, you'll get a different prefix).
Here, `x_ok` lets you quickly find the inputs that failed:
```{r}
debug |> filter(!x_ok)
```
`x_pieces` tells us how many pieces were found, compared to the expected 3 (the length of `names`).
`x_remainder` isn't useful when there are too few pieces, but we'll see it again shortly.
Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating.
In that case, fix the problem upstream and make sure to remove `too_few = "debug"` to ensure that new problems become errors.
In other cases, you may want to fill in the missing pieces with `NA`s and move on.
That's the job of `too_few = "align_start"` and `too_few = "align_end"` which allow you to control where the `NA`s should go:
```{r}
df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_few = "align_start"
)
```
The same principles apply if you have too many pieces:
```{r}
#| error: true
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z")
)
```
But now, when we debug the result, you can see the purpose of `x_remainder`:
```{r}
debug <- df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "debug"
)
debug |> filter(!x_ok)
```
You have a slightly different set of options for handling too many pieces: you can either silently "drop" any additional pieces or "merge" them all into the final column:
```{r}
df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "drop"
)
df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "merge"
)
```
## Letters
In this section, we'll introduce you to functions that allow you to work with the individual letters within a string.
You'll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.
### Length
`str_length()` tells you the number of letters in the string:
```{r}
str_length(c("a", "R for data science", NA))
```
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names, which happen to have 15 letters[^strings-6]:
[^strings-6]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters.
```{r}
babynames |>
count(length = str_length(name), wt = n)
babynames |>
filter(str_length(name) == 15) |>
count(name, wt = n, sort = TRUE)
```
### Subsetting
You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the positions where the substring should start and end.
The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`:
```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```
You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.
```{r}
str_sub(x, -3, -1)
```
Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
```{r}
str_sub("a", 1, 5)
```
We could use `str_sub()` with `mutate()` to find the first and last letter of each name:
```{r}
babynames |>
mutate(
first = str_sub(name, 1, 1),
last = str_sub(name, -1, -1)
)
```
### Exercises
1. When computing the distribution of the length of babynames, why did we use `wt = n`?
2. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
3. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
## Non-English text {#sec-other-languages}
So far, we've focused on English language text which is particularly easy to work with for two reasons.
Firstly, the English alphabet is relatively simple: there are just 26 letters.
Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers.
Unfortunately, we don't have room for a full treatment of non-English languages.
Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.
### Encoding
When working with non-English text, the first challenge is often the **encoding**.
To understand what's going on, we need to dive into how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each of these six hexadecimal numbers represents one letter: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII.
ASCII does a great job of representing English characters because it's the **American** Standard Code for Information Interchange.
Things aren't so easy for languages other than English.
In the early days of computing, there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages.
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.
readr uses UTF-8 everywhere.
This is a good default but will fail for data produced by older systems that don't use UTF-8.
If this happens, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times, you'll get complete gibberish.
For example here are two inline CSVs with unusual encodings[^strings-7]:
[^strings-7]: Here I'm using the special `\x` to encode binary data directly into a string.
```{r}
#| message: false
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)
x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
```
To read these correctly, you specify the encoding via the `locale` argument:
```{r}
#| message: false
read_csv(x1, locale = locale(encoding = "Latin1"))
read_csv(x2, locale = locale(encoding = "Shift-JIS"))
```
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof and works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
Encodings are a rich and complex topic; we've only scratched the surface here.
If you'd like to learn more, we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Letter variations
Working in languages with accents poses a significant challenge when determining the position of letters (e.g. with `str_length()` and `str_sub()`) as accented letters might be encoded as a single individual character (e.g. ü) or as two characters by combining an unaccented letter (e.g. u) with a diacritic mark (e.g. ¨).
For example, this code shows two ways of representing ü that look identical:
```{r}
u <- c("\u00fc", "u\u0308")
str_view(u)
```
But both strings differ in length, and their first characters are different:
```{r}
str_length(u)
str_sub(u, 1, 1)
```
Finally, note that a comparison of these strings with `==` interprets these strings as different, while the handy `str_equal()` function in stringr recognizes that both have the same appearance:
```{r}
u[[1]] == u[[2]]
str_equal(u[[1]], u[[2]])
```
### Locale-dependent functions
Finally, there are a handful of stringr functions whose behavior depends on your **locale**.
A locale is similar to a language but includes an optional region specifier to handle regional variations within a language.
A locale is specified by a lower-case language abbreviation, optionally followed by a `_` and an upper-case region identifier.
For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported in stringr by looking at `stringi::stri_locale_list()`.
Base R string functions automatically use the locale set by your operating system.
This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country.
To avoid this problem, stringr defaults to English rules by using the "en" locale and requires you to specify the `locale` argument to override it.
Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.
The rules for changing cases differ among languages.
For example, Turkish has two i's: with and without a dot.
Since they're two distinct letters, they're capitalized differently:
```{r}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language[^strings-8]!
Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.
[^strings-8]: Sorting in languages that don't have an alphabet, like Chinese, is more complicated still.
```{r}
str_sort(c("a", "c", "ch", "h", "z"))
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
```
This also comes up when sorting strings with `dplyr::arrange()`, which is why it also has a `locale` argument.
## Summary
In this chapter, you've learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings.
Now it's time to learn one of the most important and powerful tools for working with strings: regular expressions.
Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.