diff --git a/.nojekyll b/.nojekyll index 7b79c48..b2fd269 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -80dfce92 \ No newline at end of file +e8ad3335 \ No newline at end of file diff --git a/lessons/lesson10_stringr.html b/lessons/lesson10_stringr.html index 660b122..5ff7241 100644 --- a/lessons/lesson10_stringr.html +++ b/lessons/lesson10_stringr.html @@ -464,8 +464,34 @@

Finding Matches

Counting Matches

In the outcome_df data set, each symbol in the column usePatternUDS represents the patient status during a routine weekly clinic visit. The o symbol is used to represent a week when a clinical trial participant failed to visit the clinic for follow-up care. We can count how in many weeks each trial participant was missing (since this is an example, we will only look a the first 20 participants):

-
outcome_df$usePatternUDS[1:20] %>% 
-  str_count(pattern = "o")
+
# Answer to exercise above, for you to confirm that you did it correctly:
+outcome_df <- 
+  outcomesCTN0094 %>% 
+  select(who, usePatternUDS, RsT_ctnNinetyFour_2023)
+
+# Inspect the data
+outcome_df
+
+
# A tibble: 3,560 × 3
+     who usePatternUDS             RsT_ctnNinetyFour_2023
+   <dbl> <chr>                                      <dbl>
+ 1     1 ooooooooooooooo                                1
+ 2     2 *---oo-o-o-o+oo                               12
+ 3     3 o-ooo-ooooooooooooooooo                        7
+ 4     4 -------------------o-o-o                      21
+ 5     5 ooooooooooooooo                                1
+ 6     6 *oooooooooooooo                                1
+ 7     7 ----oooooooooooooooooooo                       5
+ 8     8 ooooooooooooooooooooooooo                      1
+ 9     9 oooooooooooooooooooooo                         1
+10    10 ------+++-++++o+++o+-o                        11
+# ℹ 3,550 more rows
+
+
# Look at the opioid use pattern of the first 20 participants
+outcome_df %>%
+  slice(1:20) %>% 
+  pull(usePatternUDS) %>% 
+  str_count(pattern = "o")
 [1] 15  7 21  3 15 14 20 25 22  3  6  6 11 19 25  6  0 25  7 25
@@ -502,12 +528,12 @@

Functions to Know

Replacing one Pattern with Another

In the fruit vector, we could replace all the vowels with upper case letters to help children identify vowels within words. Recall that str_replace() only replaces the first match in the character string, so we will use str_replace_all(). This will give an example of piping multiple string commands together (which is often how we perform string manipulation).

-
fruit %>% 
-  str_replace_all(pattern = "a", replacement = "A") %>% 
-  str_replace_all(pattern = "e", replacement = "E") %>% 
-  str_replace_all(pattern = "i", replacement = "I") %>% 
-  str_replace_all(pattern = "o", replacement = "O") %>% 
-  str_replace_all(pattern = "u", replacement = "U")
+
fruit %>% 
+  str_replace_all(pattern = "a", replacement = "A") %>% 
+  str_replace_all(pattern = "e", replacement = "E") %>% 
+  str_replace_all(pattern = "i", replacement = "I") %>% 
+  str_replace_all(pattern = "o", replacement = "O") %>% 
+  str_replace_all(pattern = "u", replacement = "U")
 [1] "ApplE"             "AprIcOt"           "AvOcAdO"          
  [4] "bAnAnA"            "bEll pEppEr"       "bIlbErry"         
@@ -556,12 +582,12 @@ 

Replaci

Removing Characters that Match a Pattern

In much of text analysis, sentences are analyzed without the “filler words” (known as stop words), such as “and”, “to”, “the”, “of”, “a”, “was”, “is”, etc. We can remove these words from our set of sentences.

-
sentences[1:20] %>% 
-  str_remove_all(pattern = "and") %>% 
-  str_remove_all(pattern = "to") %>% 
-  str_remove_all(pattern = "the") %>% 
-  str_remove_all(pattern = "of") %>% 
-  str_remove_all(pattern = "a")
+
sentences[1:20] %>% 
+  str_remove_all(pattern = "and") %>% 
+  str_remove_all(pattern = "to") %>% 
+  str_remove_all(pattern = "the") %>% 
+  str_remove_all(pattern = "of") %>% 
+  str_remove_all(pattern = "a")
 [1] "The birch cnoe slid on  smooth plnks."   
  [2] "Glue  sheet   drk blue bckground."       
@@ -614,10 +640,10 @@ 

R

Changing Case

In the above example, some of the stop words were not removed because they were at the start of the sentence (and therefore had a capital letter). We can change all the letters in a string to be the same case (which makes pattern matching easier) with the str_to_lower() and str_to_upper() functions. Notice that we added the str_to_lower() call in the pipeline before removing the stop words.

-
sentences[1:20] %>% 
-  str_to_lower() %>% 
-  str_remove_all(pattern = "and ") %>% 
-  str_remove_all(pattern = "the ")
+
sentences[1:20] %>% 
+  str_to_lower() %>% 
+  str_remove_all(pattern = "and ") %>% 
+  str_remove_all(pattern = "the ")
 [1] "birch canoe slid on smooth planks."        
  [2] "glue sheet to dark blue background."       
@@ -682,23 +708,23 @@ 

Functions to Know

In my experience, the functions in this section are most useful when dealing with very organized text data. For example, my students and I were working on a dataset that recorded the heights of participants as text. The entries of this data table would have been something like this:

-
heightsF_char <- c("60in", "68in", "66in", "60in", "65in", "62in", "63in")
-heightsM_char <- c("72in", "68in", "73in", "65in", "71in", "66in", "67in")
+
heightsF_char <- c("60in", "68in", "66in", "60in", "65in", "62in", "63in")
+heightsM_char <- c("72in", "68in", "73in", "65in", "71in", "66in", "67in")

Substrings by Position

If we know that the information we want is always in the same position, then we can create a substring using only the “letters” between these positions with str_sub().

-
# Count forward (from the start of the string):
-heightsF_char %>% 
-  str_sub(start = 1, end = 2)
+
# Count forward (from the start of the string):
+heightsF_char %>% 
+  str_sub(start = 1, end = 2)
[1] "60" "68" "66" "60" "65" "62" "63"
-
# Count backwards (from the end of the string):
-heightsF_char %>% 
-  str_sub(start = -4, end = -3)
+
# Count backwards (from the end of the string):
+heightsF_char %>% 
+  str_sub(start = -4, end = -3)
[1] "60" "68" "66" "60" "65" "62" "63"
@@ -721,9 +747,9 @@

Substrings by Posit

Substrings by Pattern

Instead, if we know that the information we want is always the same pattern, then we can extract the matching pattern with str_extract().

-
heightsF_char %>% 
-  # We want the numeric digits (\\d) that are two characters long ({2})
-  str_extract(pattern = "\\d{2}")
+
heightsF_char %>% 
+  # We want the numeric digits (\\d) that are two characters long ({2})
+  str_extract(pattern = "\\d{2}")
[1] "60" "68" "66" "60" "65" "62" "63"
@@ -766,13 +792,13 @@

Functions to Know

String Lengths

The str_length() functions is useful when dealing with \(n\)-ary words. These are sets of letters or numbers where each single symbol from a pre-defined set of \(n\) possible symbols represents a state in a system. Examples include the use pattern “words” in the outcome_df data set; DNA/RNA (“CCCCAACGTGTG” is a string of letters where each single letter represents a one of the four DNA nucleotides bases—Cytosine, Adenine, Thymine, and Guanine); or class attendance (“PPPPPAPP” represents a student’s attendance record over eight weeks as “Present” or “Absent”).

-
# How many nucleotides in the strand?
-str_length("CCCCAACGTGTG")
+
# How many nucleotides in the strand?
+str_length("CCCCAACGTGTG")
[1] 12
-
# How many weeks of attendance data?
-str_length("PPPPPAPP")
+
# How many weeks of attendance data?
+str_length("PPPPPAPP")
[1] 8
@@ -783,31 +809,31 @@

Trimming Strings

This comes up for me most often when dealing with very long labels in ggplot figures. Sometimes a factor label is really long, and ggplot tries to fit the whole label in the figure, which ends up making the whole figure look weird.

Here’s an example. I’m going to create a simple data set with one very long factor label.

-
bookPages_df <- tibble(
-  title = c("Germinal", "Frankenstein; or, The Modern Prometheus"),
-  author = c("Emile Zola", "Mary Shelley"),
-  pageCountOriginal = c(591L, 362L),
-  year = c(1885, 1818)
-)
-
-# Original
-ggplot(data = bookPages_df) + 
-  aes(x = year, y = pageCountOriginal, shape = title) + 
-  geom_point()
+
bookPages_df <- tibble(
+  title = c("Germinal", "Frankenstein; or, The Modern Prometheus"),
+  author = c("Emile Zola", "Mary Shelley"),
+  pageCountOriginal = c(591L, 362L),
+  year = c(1885, 1818)
+)
+
+# Original
+ggplot(data = bookPages_df) + 
+  aes(x = year, y = pageCountOriginal, shape = title) + 
+  geom_point()

Now I’m going to truncate the very long title of Frankenstein.

-
# Truncated text
-bookPages_df %>% 
-  mutate(
-    title = str_trunc(title, width = 15)
-  ) %>% 
-  ggplot() +
-    aes(x = year, y = pageCountOriginal, shape = title) + 
-    geom_point()
+
# Truncated text
+bookPages_df %>% 
+  mutate(
+    title = str_trunc(title, width = 15)
+  ) %>% 
+  ggplot() +
+    aes(x = year, y = pageCountOriginal, shape = title) + 
+    geom_point()

@@ -817,21 +843,21 @@

Trimming Strings

Padding Strings

This comes up when I’m trying to create file names in a computer. Here’s the issue:

-
1:11 %>% 
-  as.character() %>% 
-  sort()
+
1:11 %>% 
+  as.character() %>% 
+  sort()
 [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" 

When the computer turns numbers into characters, the ordering of the numbers gets destroyed. We all know that I want 10 and 11 to come last, but the computer doesn’t interpret these numbers the way that I do. The solution is to pad the numbers on the left with “0” so that the ordering is preserved:

-
1:11 %>% 
-  as.character() %>% 
-  # Set the width to 2 digits if 99 is enough, but increase to 3 digits in case
-  #   I need to go past 99 (up to 999)
-  str_pad(width = 3, side = "left", pad = "0") %>% 
-  sort()
+
1:11 %>% 
+  as.character() %>% 
+  # Set the width to 2 digits if 99 is enough, but increase to 3 digits in case
+  #   I need to go past 99 (up to 999)
+  str_pad(width = 3, side = "left", pad = "0") %>% 
+  sort()
 [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
@@ -848,40 +874,40 @@

Modifying Strings in Tibbles

Example: Plotting Participant Heights

Here I show an entire (simplified) workflow to take in the height data and plot it by biological sex.

-
###  Create tidy data  ###
-heights_df <- tibble(
-  is_female = c(
-    rep(TRUE, length(heightsF_char)),
-    rep(FALSE, length(heightsM_char))
-  ),
-  heights = c(heightsF_char, heightsM_char)
-)
-
-###  Wrangle the Data  ###
-heightsClean_df <- 
-  heights_df %>% 
-  # Step 1: Split the Units into another column
-  mutate(
-    units = str_sub(heights, start = -2, end = -1)
-  ) %>% 
-  # Step 2: Extract the height values
-  mutate(
-    value = str_extract(heights, pattern = "\\d{2}")
-  ) %>% 
-  # Step 3: change heights from character to numeric
-  mutate(
-    value = as.numeric(value)
-  ) %>% 
-  # Step 4: remove the original column (check your work beforehand)
-  select(-heights) %>% 
-  # Step 5: rename
-  rename(height = value)
-
-###  Plot the Relationship  ###
-ggplot(data = heightsClean_df) + 
-  theme_classic() + 
-  aes(x = is_female, y = height) + 
-  geom_boxplot()
+
###  Create tidy data  ###
+heights_df <- tibble(
+  is_female = c(
+    rep(TRUE, length(heightsF_char)),
+    rep(FALSE, length(heightsM_char))
+  ),
+  heights = c(heightsF_char, heightsM_char)
+)
+
+###  Wrangle the Data  ###
+heightsClean_df <- 
+  heights_df %>% 
+  # Step 1: Split the Units into another column
+  mutate(
+    units = str_sub(heights, start = -2, end = -1)
+  ) %>% 
+  # Step 2: Extract the height values
+  mutate(
+    value = str_extract(heights, pattern = "\\d{2}")
+  ) %>% 
+  # Step 3: change heights from character to numeric
+  mutate(
+    value = as.numeric(value)
+  ) %>% 
+  # Step 4: remove the original column (check your work beforehand)
+  select(-heights) %>% 
+  # Step 5: rename
+  rename(height = value)
+
+###  Plot the Relationship  ###
+ggplot(data = heightsClean_df) + 
+  theme_classic() + 
+  aes(x = is_female, y = height) + 
+  geom_boxplot()

diff --git a/search.json b/search.json index 5070435..b1196ad 100644 --- a/search.json +++ b/search.json @@ -151,7 +151,7 @@ "href": "lessons/lesson10_stringr.html#counting-matches", "title": "Lesson 10: Wrangling Character Strings with stringr", "section": "Counting Matches", - "text": "Counting Matches\nIn the outcome_df data set, each symbol in the column usePatternUDS represents the patient status during a routine weekly clinic visit. The o symbol is used to represent a week when a clinical trial participant failed to visit the clinic for follow-up care. We can count how in many weeks each trial participant was missing (since this is an example, we will only look a the first 20 participants):\n\noutcome_df$usePatternUDS[1:20] %>% \n str_count(pattern = \"o\")\n\n [1] 15 7 21 3 15 14 20 25 22 3 6 6 11 19 25 6 0 25 7 25\n\n\n\n\n\n\n\n\nExercise\n\n\n\nMissing 3 clinic visits in a row is often a strong prognostic signal for a negative health outcome. Count the number of times per participant that the pattern “ooo” is seen. Use the first 20 patients only." + "text": "Counting Matches\nIn the outcome_df data set, each symbol in the column usePatternUDS represents the patient status during a routine weekly clinic visit. The o symbol is used to represent a week when a clinical trial participant failed to visit the clinic for follow-up care. We can count how in many weeks each trial participant was missing (since this is an example, we will only look a the first 20 participants):\n\n# Answer to exercise above, for you to confirm that you did it correctly:\noutcome_df <- \n outcomesCTN0094 %>% \n select(who, usePatternUDS, RsT_ctnNinetyFour_2023)\n\n# Inspect the data\noutcome_df\n\n# A tibble: 3,560 × 3\n who usePatternUDS RsT_ctnNinetyFour_2023\n \n 1 1 ooooooooooooooo 1\n 2 2 *---oo-o-o-o+oo 12\n 3 3 o-ooo-ooooooooooooooooo 7\n 4 4 -------------------o-o-o 21\n 5 5 ooooooooooooooo 1\n 6 6 *oooooooooooooo 1\n 7 7 ----oooooooooooooooooooo 5\n 8 8 ooooooooooooooooooooooooo 1\n 9 9 oooooooooooooooooooooo 1\n10 10 ------+++-++++o+++o+-o 11\n# ℹ 3,550 more rows\n\n# Look at the opioid use pattern of the first 20 participants\noutcome_df %>%\n slice(1:20) %>% \n pull(usePatternUDS) %>% \n str_count(pattern = \"o\")\n\n [1] 15 7 21 3 15 14 20 25 22 3 6 6 11 19 25 6 0 25 7 25\n\n\n\n\n\n\n\n\nExercise\n\n\n\nMissing 3 clinic visits in a row is often a strong prognostic signal for a negative health outcome. Count the number of times per participant that the pattern “ooo” is seen. Use the first 20 patients only." }, { "objectID": "lessons/lesson10_stringr.html#functions-to-know-1", diff --git a/sitemap.xml b/sitemap.xml index ced0ba8..4e15ecc 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,74 +2,74 @@ https://gabrielodom.github.io/PHC6701_r4ds/index.html - 2024-01-03T15:34:37.505Z + 2024-01-03T16:45:26.780Z https://gabrielodom.github.io/PHC6701_r4ds/about.html - 2024-01-03T15:34:38.027Z + 2024-01-03T16:45:27.325Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson01_greater_data_science.html - 2024-01-03T15:34:40.506Z + 2024-01-03T16:45:29.790Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson10_stringr.html - 2024-01-03T15:34:45.629Z + 2024-01-03T16:45:34.994Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson07_lists_and_tibbles.html - 2024-01-03T15:34:50.632Z + 2024-01-03T16:45:39.961Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson06_atomic_vectors.html - 2024-01-03T15:34:53.957Z + 2024-01-03T16:45:43.306Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson04s_examples.html - 2024-01-03T15:35:10.541Z + 2024-01-03T16:45:59.699Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson11_functions.html - 2024-01-03T15:35:13.274Z + 2024-01-03T16:46:02.402Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson09s_dplyr_v_base.html - 2024-01-03T15:35:17.787Z + 2024-01-03T16:46:06.810Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson04_ggplot.html - 2024-01-03T15:35:28.711Z + 2024-01-03T16:46:17.637Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson12_purrr.html - 2024-01-03T15:35:35.254Z + 2024-01-03T16:46:23.995Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson05_RStudio_projects.html - 2024-01-03T15:35:37.803Z + 2024-01-03T16:46:26.624Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson02s_scripts.html - 2024-01-03T15:35:39.256Z + 2024-01-03T16:46:28.002Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson08_Data_Read_Write.html - 2024-01-03T15:35:43.952Z + 2024-01-03T16:46:32.454Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson13_eda_w_tidyverse.html - 2024-01-03T15:35:51.889Z + 2024-01-03T16:46:40.431Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson02_introduction_to_R.html - 2024-01-03T15:35:54.467Z + 2024-01-03T16:46:42.968Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson03_introduction_to_Quarto.html - 2024-01-03T15:35:55.156Z + 2024-01-03T16:46:43.647Z https://gabrielodom.github.io/PHC6701_r4ds/lessons/lesson09_dplyr.html - 2024-01-03T15:36:09.833Z + 2024-01-03T16:46:57.965Z