$usePatternUDS[1:20] %>%
- outcome_dfstr_count(pattern = "o")
# Answer to exercise above, for you to confirm that you did it correctly:
+<-
+ outcome_df %>%
+ outcomesCTN0094 select(who, usePatternUDS, RsT_ctnNinetyFour_2023)
+
+# Inspect the data
+ outcome_df
# A tibble: 3,560 × 3
+ who usePatternUDS RsT_ctnNinetyFour_2023
+ <dbl> <chr> <dbl>
+ 1 1 ooooooooooooooo 1
+ 2 2 *---oo-o-o-o+oo 12
+ 3 3 o-ooo-ooooooooooooooooo 7
+ 4 4 -------------------o-o-o 21
+ 5 5 ooooooooooooooo 1
+ 6 6 *oooooooooooooo 1
+ 7 7 ----oooooooooooooooooooo 5
+ 8 8 ooooooooooooooooooooooooo 1
+ 9 9 oooooooooooooooooooooo 1
+10 10 ------+++-++++o+++o+-o 11
+# ℹ 3,550 more rows
+# Look at the opioid use pattern of the first 20 participants
+%>%
+ outcome_df slice(1:20) %>%
+ pull(usePatternUDS) %>%
+ str_count(pattern = "o")
[1] 15 7 21 3 15 14 20 25 22 3 6 6 11 19 25 6 0 25 7 25
Functions to Know
Replacing one Pattern with Another
In the fruit
vector, we could replace all the vowels with upper case letters to help children identify vowels within words. Recall that str_replace()
only replaces the first match in the character string, so we will use str_replace_all()
. This will give an example of piping multiple string commands together (which is often how we perform string manipulation).
%>%
- fruit str_replace_all(pattern = "a", replacement = "A") %>%
- str_replace_all(pattern = "e", replacement = "E") %>%
- str_replace_all(pattern = "i", replacement = "I") %>%
- str_replace_all(pattern = "o", replacement = "O") %>%
- str_replace_all(pattern = "u", replacement = "U")
%>%
+ fruit str_replace_all(pattern = "a", replacement = "A") %>%
+ str_replace_all(pattern = "e", replacement = "E") %>%
+ str_replace_all(pattern = "i", replacement = "I") %>%
+ str_replace_all(pattern = "o", replacement = "O") %>%
+ str_replace_all(pattern = "u", replacement = "U")
[1] "ApplE" "AprIcOt" "AvOcAdO"
[4] "bAnAnA" "bEll pEppEr" "bIlbErry"
@@ -556,12 +582,12 @@ Replaci
Removing Characters that Match a Pattern
In much of text analysis, sentences are analyzed without the “filler words” (known as stop words), such as “and”, “to”, “the”, “of”, “a”, “was”, “is”, etc. We can remove these words from our set of sentences.
-1:20] %>%
- sentences[str_remove_all(pattern = "and") %>%
- str_remove_all(pattern = "to") %>%
- str_remove_all(pattern = "the") %>%
- str_remove_all(pattern = "of") %>%
- str_remove_all(pattern = "a")
+1:20] %>%
+ sentences[str_remove_all(pattern = "and") %>%
+ str_remove_all(pattern = "to") %>%
+ str_remove_all(pattern = "the") %>%
+ str_remove_all(pattern = "of") %>%
+ str_remove_all(pattern = "a")
[1] "The birch cnoe slid on smooth plnks."
[2] "Glue sheet drk blue bckground."
@@ -614,10 +640,10 @@ R
Changing Case
In the above example, some of the stop words were not removed because they were at the start of the sentence (and therefore had a capital letter). We can change all the letters in a string to be the same case (which makes pattern matching easier) with the str_to_lower()
and str_to_upper()
functions. Notice that we added the str_to_lower()
call in the pipeline before removing the stop words.
-1:20] %>%
- sentences[str_to_lower() %>%
- str_remove_all(pattern = "and ") %>%
- str_remove_all(pattern = "the ")
+1:20] %>%
+ sentences[str_to_lower() %>%
+ str_remove_all(pattern = "and ") %>%
+ str_remove_all(pattern = "the ")
[1] "birch canoe slid on smooth planks."
[2] "glue sheet to dark blue background."
@@ -682,23 +708,23 @@ Functions to Know
In my experience, the functions in this section are most useful when dealing with very organized text data. For example, my students and I were working on a dataset that recorded the heights of participants as text. The entries of this data table would have been something like this:
-<- c("60in", "68in", "66in", "60in", "65in", "62in", "63in")
- heightsF_char <- c("72in", "68in", "73in", "65in", "71in", "66in", "67in") heightsM_char
+<- c("60in", "68in", "66in", "60in", "65in", "62in", "63in")
+ heightsF_char <- c("72in", "68in", "73in", "65in", "71in", "66in", "67in") heightsM_char
Substrings by Position
If we know that the information we want is always in the same position, then we can create a substring using only the “letters” between these positions with str_sub()
.
-# Count forward (from the start of the string):
-%>%
- heightsF_char str_sub(start = 1, end = 2)
+# Count forward (from the start of the string):
+%>%
+ heightsF_char str_sub(start = 1, end = 2)
[1] "60" "68" "66" "60" "65" "62" "63"
-# Count backwards (from the end of the string):
-%>%
- heightsF_char str_sub(start = -4, end = -3)
+# Count backwards (from the end of the string):
+%>%
+ heightsF_char str_sub(start = -4, end = -3)
[1] "60" "68" "66" "60" "65" "62" "63"
@@ -721,9 +747,9 @@ Substrings by Posit
Substrings by Pattern
Instead, if we know that the information we want is always the same pattern, then we can extract the matching pattern with str_extract()
.
-%>%
- heightsF_char # We want the numeric digits (\\d) that are two characters long ({2})
- str_extract(pattern = "\\d{2}")
+%>%
+ heightsF_char # We want the numeric digits (\\d) that are two characters long ({2})
+ str_extract(pattern = "\\d{2}")
[1] "60" "68" "66" "60" "65" "62" "63"
@@ -766,13 +792,13 @@ Functions to Know
String Lengths
The str_length()
functions is useful when dealing with \(n\)-ary words. These are sets of letters or numbers where each single symbol from a pre-defined set of \(n\) possible symbols represents a state in a system. Examples include the use pattern “words” in the outcome_df
data set; DNA/RNA (“CCCCAACGTGTG” is a string of letters where each single letter represents a one of the four DNA nucleotides bases—Cytosine, Adenine, Thymine, and Guanine); or class attendance (“PPPPPAPP” represents a student’s attendance record over eight weeks as “Present” or “Absent”).
-# How many nucleotides in the strand?
-str_length("CCCCAACGTGTG")
+# How many nucleotides in the strand?
+str_length("CCCCAACGTGTG")
[1] 12
-# How many weeks of attendance data?
-str_length("PPPPPAPP")
+# How many weeks of attendance data?
+str_length("PPPPPAPP")
[1] 8
@@ -783,31 +809,31 @@ Trimming Strings
This comes up for me most often when dealing with very long labels in ggplot
figures. Sometimes a factor label is really long, and ggplot
tries to fit the whole label in the figure, which ends up making the whole figure look weird.
Here’s an example. I’m going to create a simple data set with one very long factor label.
-<- tibble(
- bookPages_df title = c("Germinal", "Frankenstein; or, The Modern Prometheus"),
- author = c("Emile Zola", "Mary Shelley"),
- pageCountOriginal = c(591L, 362L),
- year = c(1885, 1818)
-
- )
-# Original
-ggplot(data = bookPages_df) +
-aes(x = year, y = pageCountOriginal, shape = title) +
- geom_point()
+<- tibble(
+ bookPages_df title = c("Germinal", "Frankenstein; or, The Modern Prometheus"),
+ author = c("Emile Zola", "Mary Shelley"),
+ pageCountOriginal = c(591L, 362L),
+ year = c(1885, 1818)
+
+ )
+# Original
+ggplot(data = bookPages_df) +
+aes(x = year, y = pageCountOriginal, shape = title) +
+ geom_point()
Now I’m going to truncate the very long title of Frankenstein.
-# Truncated text
-%>%
- bookPages_df mutate(
- title = str_trunc(title, width = 15)
- %>%
- ) ggplot() +
- aes(x = year, y = pageCountOriginal, shape = title) +
- geom_point()
+# Truncated text
+%>%
+ bookPages_df mutate(
+ title = str_trunc(title, width = 15)
+ %>%
+ ) ggplot() +
+ aes(x = year, y = pageCountOriginal, shape = title) +
+ geom_point()
@@ -817,21 +843,21 @@ Trimming Strings
Padding Strings
This comes up when I’m trying to create file names in a computer. Here’s the issue:
-1:11 %>%
-as.character() %>%
- sort()
+1:11 %>%
+as.character() %>%
+ sort()
[1] "1" "10" "11" "2" "3" "4" "5" "6" "7" "8" "9"
When the computer turns numbers into characters, the ordering of the numbers gets destroyed. We all know that I want 10 and 11 to come last, but the computer doesn’t interpret these numbers the way that I do. The solution is to pad the numbers on the left with “0” so that the ordering is preserved:
-1:11 %>%
-as.character() %>%
- # Set the width to 2 digits if 99 is enough, but increase to 3 digits in case
- # I need to go past 99 (up to 999)
- str_pad(width = 3, side = "left", pad = "0") %>%
- sort()
+1:11 %>%
+as.character() %>%
+ # Set the width to 2 digits if 99 is enough, but increase to 3 digits in case
+ # I need to go past 99 (up to 999)
+ str_pad(width = 3, side = "left", pad = "0") %>%
+ sort()
[1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
@@ -848,40 +874,40 @@ Modifying Strings in Tibbles
Example: Plotting Participant Heights
Here I show an entire (simplified) workflow to take in the height data and plot it by biological sex.
-### Create tidy data ###
-<- tibble(
- heights_df is_female = c(
- rep(TRUE, length(heightsF_char)),
- rep(FALSE, length(heightsM_char))
-
- ),heights = c(heightsF_char, heightsM_char)
-
- )
-### Wrangle the Data ###
-<-
- heightsClean_df %>%
- heights_df # Step 1: Split the Units into another column
- mutate(
- units = str_sub(heights, start = -2, end = -1)
- %>%
- ) # Step 2: Extract the height values
- mutate(
- value = str_extract(heights, pattern = "\\d{2}")
- %>%
- ) # Step 3: change heights from character to numeric
- mutate(
- value = as.numeric(value)
- %>%
- ) # Step 4: remove the original column (check your work beforehand)
- select(-heights) %>%
- # Step 5: rename
- rename(height = value)
-
-### Plot the Relationship ###
-ggplot(data = heightsClean_df) +
-theme_classic() +
- aes(x = is_female, y = height) +
- geom_boxplot()
+### Create tidy data ###
+<- tibble(
+ heights_df is_female = c(
+ rep(TRUE, length(heightsF_char)),
+ rep(FALSE, length(heightsM_char))
+
+ ),heights = c(heightsF_char, heightsM_char)
+
+ )
+### Wrangle the Data ###
+<-
+ heightsClean_df %>%
+ heights_df # Step 1: Split the Units into another column
+ mutate(
+ units = str_sub(heights, start = -2, end = -1)
+ %>%
+ ) # Step 2: Extract the height values
+ mutate(
+ value = str_extract(heights, pattern = "\\d{2}")
+ %>%
+ ) # Step 3: change heights from character to numeric
+ mutate(
+ value = as.numeric(value)
+ %>%
+ ) # Step 4: remove the original column (check your work beforehand)
+ select(-heights) %>%
+ # Step 5: rename
+ rename(height = value)
+
+### Plot the Relationship ###
+ggplot(data = heightsClean_df) +
+theme_classic() +
+ aes(x = is_female, y = height) +
+ geom_boxplot()
diff --git a/search.json b/search.json
index 5070435..b1196ad 100644
--- a/search.json
+++ b/search.json
@@ -151,7 +151,7 @@
"href": "lessons/lesson10_stringr.html#counting-matches",
"title": "Lesson 10: Wrangling Character Strings with stringr",
"section": "Counting Matches",
- "text": "Counting Matches\nIn the outcome_df data set, each symbol in the column usePatternUDS represents the patient status during a routine weekly clinic visit. The o symbol is used to represent a week when a clinical trial participant failed to visit the clinic for follow-up care. We can count how in many weeks each trial participant was missing (since this is an example, we will only look a the first 20 participants):\n\noutcome_df$usePatternUDS[1:20] %>% \n str_count(pattern = \"o\")\n\n [1] 15 7 21 3 15 14 20 25 22 3 6 6 11 19 25 6 0 25 7 25\n\n\n\n\n\n\n\n\nExercise\n\n\n\nMissing 3 clinic visits in a row is often a strong prognostic signal for a negative health outcome. Count the number of times per participant that the pattern “ooo” is seen. Use the first 20 patients only."
+ "text": "Counting Matches\nIn the outcome_df data set, each symbol in the column usePatternUDS represents the patient status during a routine weekly clinic visit. The o symbol is used to represent a week when a clinical trial participant failed to visit the clinic for follow-up care. We can count how in many weeks each trial participant was missing (since this is an example, we will only look a the first 20 participants):\n\n# Answer to exercise above, for you to confirm that you did it correctly:\noutcome_df <- \n outcomesCTN0094 %>% \n select(who, usePatternUDS, RsT_ctnNinetyFour_2023)\n\n# Inspect the data\noutcome_df\n\n# A tibble: 3,560 × 3\n who usePatternUDS RsT_ctnNinetyFour_2023\n \n 1 1 ooooooooooooooo 1\n 2 2 *---oo-o-o-o+oo 12\n 3 3 o-ooo-ooooooooooooooooo 7\n 4 4 -------------------o-o-o 21\n 5 5 ooooooooooooooo 1\n 6 6 *oooooooooooooo 1\n 7 7 ----oooooooooooooooooooo 5\n 8 8 ooooooooooooooooooooooooo 1\n 9 9 oooooooooooooooooooooo 1\n10 10 ------+++-++++o+++o+-o 11\n# ℹ 3,550 more rows\n\n# Look at the opioid use pattern of the first 20 participants\noutcome_df %>%\n slice(1:20) %>% \n pull(usePatternUDS) %>% \n str_count(pattern = \"o\")\n\n [1] 15 7 21 3 15 14 20 25 22 3 6 6 11 19 25 6 0 25 7 25\n\n\n\n\n\n\n\n\nExercise\n\n\n\nMissing 3 clinic visits in a row is often a strong prognostic signal for a negative health outcome. Count the number of times per participant that the pattern “ooo” is seen. Use the first 20 patients only."
},
{
"objectID": "lessons/lesson10_stringr.html#functions-to-know-1",
diff --git a/sitemap.xml b/sitemap.xml
index ced0ba8..4e15ecc 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,74 +2,74 @@
https://gabrielodom.github.io/PHC6701_r4ds/index.html
- 2024-01-03T15:34:37.505Z
+ 2024-01-03T16:45:26.780Z
https://gabrielodom.github.io/PHC6701_r4ds/about.html
- 2024-01-03T15:34:38.027Z
+ 2024-01-03T16:45:27.325Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson01_greater_data_science.html
- 2024-01-03T15:34:40.506Z
+ 2024-01-03T16:45:29.790Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson10_stringr.html
- 2024-01-03T15:34:45.629Z
+ 2024-01-03T16:45:34.994Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson07_lists_and_tibbles.html
- 2024-01-03T15:34:50.632Z
+ 2024-01-03T16:45:39.961Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson06_atomic_vectors.html
- 2024-01-03T15:34:53.957Z
+ 2024-01-03T16:45:43.306Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson04s_examples.html
- 2024-01-03T15:35:10.541Z
+ 2024-01-03T16:45:59.699Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson11_functions.html
- 2024-01-03T15:35:13.274Z
+ 2024-01-03T16:46:02.402Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson09s_dplyr_v_base.html
- 2024-01-03T15:35:17.787Z
+ 2024-01-03T16:46:06.810Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson04_ggplot.html
- 2024-01-03T15:35:28.711Z
+ 2024-01-03T16:46:17.637Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson12_purrr.html
- 2024-01-03T15:35:35.254Z
+ 2024-01-03T16:46:23.995Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson05_RStudio_projects.html
- 2024-01-03T15:35:37.803Z
+ 2024-01-03T16:46:26.624Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson02s_scripts.html
- 2024-01-03T15:35:39.256Z
+ 2024-01-03T16:46:28.002Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson08_Data_Read_Write.html
- 2024-01-03T15:35:43.952Z
+ 2024-01-03T16:46:32.454Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson13_eda_w_tidyverse.html
- 2024-01-03T15:35:51.889Z
+ 2024-01-03T16:46:40.431Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson02_introduction_to_R.html
- 2024-01-03T15:35:54.467Z
+ 2024-01-03T16:46:42.968Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson03_introduction_to_Quarto.html
- 2024-01-03T15:35:55.156Z
+ 2024-01-03T16:46:43.647Z
https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson09_dplyr.html
- 2024-01-03T15:36:09.833Z
+ 2024-01-03T16:46:57.965Z