epiverse-trace · avallecam · Jun 17, 2024 · Jun 3, 2024 · Jun 12, 2024 · Jun 12, 2024
diff --git a/config.yaml b/config.yaml
@@ -11,7 +11,7 @@
 carpentry: 'incubator'
 
 # Overall title for pages.
-title: 'Using delays to quantify transmission'
+title: 'Read and clean case data, and make linelist for outbreak analytics with R'
 
 # Date the lesson was created (YYYY-MM-DD, this is empty by default)
 created:

diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd
@@ -15,11 +15,33 @@
 
 - Explain how to clean, curate, and standardize case data using `{cleanepi}` package
 - Demonstrate how to covert case data to `linelist` data 
+- Perform essential data-cleaning operations to be performed in a raw case dataset.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
+::::::::::::::::::::: prereq
+
+This episode requires you to:
+
+- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv)
+- Save it in the `data/` folder.
+
+:::::::::::::::::::::
+
 ## Introduction
-In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
+In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and valid to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
+
+::::::::::::::::::: checklist
+
+### The double-colon
+
+The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment. 
+
+For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
+
+This help us remember package functions and avoid namespace conflicts.
+
+:::::::::::::::::::
 
 
 The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content. 
@@ -30,9 +52,9 @@
 library("here")
 
 # Read data
-# e.g.: if path to file is data/raw-data/simulated_ebola_2.csv then:
+# e.g.: if path to file is data/simulated_ebola_2.csv then:
 raw_ebola_data <- rio::import(
-  here::here("data", "raw-data", "simulated_ebola_2.csv")
+  here::here("data", "simulated_ebola_2.csv")
 )
 ```
 
@@ -83,14 +105,18 @@
 
 If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged.
 
-**Exercise:** Standardize the column names of the input dataset, but keep the “V1” column as is.
+::::::::::::::::::::::::::::::::::::: challenge
+
+Standardize the column names of the input dataset, but keep the “V1” column as it is.
+
+::::::::::::::::::::::::::::::::::::::::::::::::
 
 ### Removing irregularities
 
 Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}`  remove such irregularities as demonstrated in the below code chunk. 
 
 ```{r}
-sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
+sim_ebola_data <- cleanepi::remove_constants(sim_ebola_data)
 sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
 ```
 
@@ -101,24 +127,24 @@
 In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}`:
 
 ```{r}
-sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
+sim_ebola_data <- cleanepi::replace_missing_values(
+  data = sim_ebola_data,
+  na_strings = ""
+)
 ```
 
 ### Validating subject IDs
 
 Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.
 
-```{r}
-# remove this chunk code once {cleanepi} is updated.
-# The coercion made here will be accounted for within {cleanepi}
-sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
-```
 
 ```{r}
-sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
-  target_columns = "case_id",
-  range = c(0, 15000)
-)
+sim_ebola_data <-
+  cleanepi::check_subject_ids(
+    data = sim_ebola_data,
+    target_columns = "case_id",
+    range = c(0, 15000)
+  )
 ```
 
 Note that our simulated  dataset does contain duplicated subject IDS.
@@ -168,8 +194,7 @@
  ```{r, warning=FALSE}
 sim_ebola_data <- cleanepi::check_date_sequence(
   data = sim_ebola_data,
-  target_columns = c("date_onset", "date_sample"),
-  remove = TRUE
+  target_columns = c("date_onset", "date_sample")
 )
  ```
 
@@ -212,7 +237,7 @@
  until the date this document was generated (`r Sys.Date()`).
 
 ```{r}
-sim_ebola_data <- cleanepi::span(
+sim_ebola_data <- cleanepi::timespan(
   sim_ebola_data,
   target_column = "date_sample",
   end_date = Sys.Date(),
@@ -234,13 +259,11 @@
 
 Further more, you can combine multiple data cleaning tasks via the pipe operator in "|>", as shown in the below code snippet. 
 ```{r}
-# remove the line below once Karim has updated cleanepi
-raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
 # PERFORM THE OPERATIONS USING THE pipe SYNTAX
 cleaned_data <- raw_ebola_data |>
   cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
-  cleanepi::replace_missing_values(target_columns = NULL) |>
-  cleanepi::remove_constant(cutoff = 1.0) |>
+  cleanepi::replace_missing_values(na_strings = "") |>
+  cleanepi::remove_constants(cutoff = 1.0) |>
   cleanepi::remove_duplicates(target_columns = NULL) |>
   cleanepi::standardize_dates(
     target_columns = c("date_onset", "date_sample"),
@@ -266,7 +289,7 @@

 You can view the report using `cleanepi::print_report()` function. 

 ![Example of data cleaning report generated by `{cleanepi}`](fig/report_demo.png)

 ## Validating and tagging case data
 In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
@@ -280,7 +303,8 @@
 
 ```{r,warning=FALSE}
 library("linelist")
-data <- linelist::make_linelist(cleaned_data,
+data <- linelist::make_linelist(
+  x = cleaned_data,
   id = "case_id",
   age = "age",
   date_onset = "date_onset",

diff --git a/episodes/delays-functions.Rmd b/episodes/delays-functions.Rmd
@@ -90,9 +90,11 @@
 
 ### The double-colon
 
-The double-colon `::` in R is used to access functions or objects from a specific package without loading the entire package into the current environment. This allows for a more targeted approach to using package components and helps avoid namespace conflicts.
+The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment. 
 
-`::` lets you call a specific function from a package by explicitly mentioning the package name. For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package without loading the entire package.
+For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
+
+This help us remember package functions and avoid namespace conflicts.
 
 :::::::::::::::::::
 
@@ -111,7 +113,7 @@

 If you need it, read in detail about the [R probability functions for the normal distribution](https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture13.htm#probfunc), each of its definitions and identify in which part of a distribution they are located!

 ![The four probability functions for the normal distribution ([Jack Weiss, 2012](https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture13.htm#probfunc))](fig/fig5a-normaldistribution.png)

 ::::::::::::::::::::

@@ -150,7 +152,7 @@
 
 ::::::::::::::::::::::::::::::::: challenge
 
-### Window for contact tracing and the Serial interval
+### Window for contact tracing and the serial interval
 
 The **serial interval** is important in the optimisation of contact tracing since it provides a time window for the containment of a disease spread ([Fine, 2003](https://academic.oup.com/aje/article/158/11/1039/162725)). Depending on the serial interval, we can evaluate the need to expand the number of days pre-onset to consider in the contact tracing to include more backwards contacts ([Davis et al., 2020](https://assets.publishing.service.gov.uk/media/61e9ab3f8fa8f50597fb3078/S0523_Oxford_-_Backwards_contact_tracing.pdf)).
 
@@ -247,7 +249,7 @@
 
 ::::::::::::::::::::::::::::::::: challenge
 
-### Length of quarantine and Incubation period
+### Length of quarantine and incubation period
 
 The **incubation period** distribution is a useful delay to assess the length of active monitoring or quarantine ([Lauer et al., 2020](https://www.acpjournals.org/doi/10.7326/M20-0504)). Similarly, delays from symptom onset to recovery (or death) will determine the required duration of health care and case isolation ([Cori et al., 2017](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)).
 
@@ -406,7 +408,7 @@
 
 ::::::::::::::::::::::::::::::::: challenge
 
-### Use an Incubation period for COVID-19 to estimate Rt
+### Use an incubation period for COVID-19 to estimate Rt
 
 Estimate the time-varying reproduction number for the first 60 days of the `example_confirmed` data set from `{EpiNow2}`. Access to an incubation period for COVID-19 from `{epiparameter}` to use it as a reporting delay.
 

diff --git a/episodes/delays-reuse.Rmd b/episodes/delays-reuse.Rmd
@@ -35,7 +35,7 @@

 Infectious diseases follow an infection cycle, which usually includes the following phases: presymptomatic period, symptomatic period and recovery period, as described by their [natural history](../learners/reference.md#naturalhistory). These time periods can be used to understand transmission dynamics and inform disease prevention and control interventions.

 ![Definition of key time periods. From [Xiang et al, 2021](https://www.sciencedirect.com/science/article/pii/S2468042721000038)](fig/time-periods.jpg)


 ::::::::::::::::: callout
@@ -61,6 +61,19 @@
 library(tidyverse)
 ```
 
+::::::::::::::::::: checklist
+
+### The double-colon
+
+The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment. 
+
+For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
+
+This help us remember package functions and avoid namespace conflicts.
+
+:::::::::::::::::::
+
+
 ## The problem
 
 If we want to estimate the transmissibility of an infection, it's common to use a package such as `{EpiEstim}` or `{EpiNow2}`. However, both require some epidemiological information as an input. For example, in `{EpiNow2}` we use `EpiNow2::dist_spec()` to specify a [generation time](../learners/reference.md#generationtime) as a probability `distribution` adding its `mean`, standard deviation (`sd`), and maximum value (`max`). To specify a `generation_time` that follows a _Gamma_ distribution with mean $\mu = 4$, standard deviation $\sigma = 2$, and a maximum value of 20, we write:
@@ -99,12 +112,12 @@

 The generation time, jointly with the reproduction number ($R$), provide valuable insights on the strength of transmission and inform the implementation of control measures. Given a $R>1$, the shorter the generation time, the earlier the incidence of disease cases will grow.

 ![Video from the MRC Centre for Global Infectious Disease Analysis, Ep 76. Science In Context - Epi Parameter Review Group with Dr Anne Cori (27-07-2023) at <https://youtu.be/VvpYHhFDIjI?si=XiUyjmSV1gKNdrrL>](fig/reproduction-generation-time.png)

 In calculating the effective reproduction number ($R_{t}$), the *generation time* distribution is often approximated by the [serial interval](../learners/reference.md#serialinterval) distribution.
 This frequent approximation is because it is easier to observe and measure the onset of symptoms than the onset of infectiousness.

 ![A schematic of the relationship of different time periods of transmission between an infector and an infectee in a transmission pair. Exposure window is defined as the time interval having viral exposure, and transmission window is defined as the time interval for onward transmission with respect to the infection time ([Chung Lau et al., 2021](https://academic.oup.com/jid/article/224/10/1664/6356465)).](fig/serial-interval-observed.jpeg)

 However, using the *serial interval* as an approximation of the *generation time* is primarily valid for diseases in which infectiousness starts after symptom onset ([Chung Lau et al., 2021](https://academic.oup.com/jid/article/224/10/1664/6356465)). In cases where infectiousness starts before symptom onset, the serial intervals can have negative values, which is the case for diseases with pre-symptomatic transmission ([Nishiura et al., 2020](https://www.ijidonline.com/article/S1201-9712(20)30119-3/fulltext#gr2)).

@@ -116,13 +129,13 @@

 When we calculate the *serial interval*, we see that not all case pairs have the same time length. We will observe this variability for any case pair and individual time period, including the [incubation period](../learners/reference.md#incubation) and [infectious period](../learners/reference.md#infectiousness).

 ![Serial intervals of possible case pairs in (a) COVID-19 and (b) MERS-CoV. Pairs represent a presumed infector and their presumed infectee plotted by date of symptom onset ([Althobaity et al., 2022](https://www.sciencedirect.com/science/article/pii/S2468042722000537#fig6)).](fig/serial-interval-pairs.jpg)

 To summarise these data from individual and pair time periods, we can find the **statistical distributions** that best fit the data ([McFarland et al., 2023](https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2023.28.27.2200806)).

 <!-- add a reference about good practices to estimate distributions -->

 ![Fitted serial interval distribution for (a) COVID-19 and (b) MERS-CoV based on reported transmission pairs in Saudi Arabia. We fitted three commonly used distributions, Lognormal, Gamma, and Weibull distributions, respectively ([Althobaity et al., 2022](https://www.sciencedirect.com/science/article/pii/S2468042722000537#fig5)).](fig/seria-interval-fitted-distributions.jpg)

 Statistical distributions are summarised in terms of their **summary statistics** like the *location* (mean and percentiles) and *spread* (variance or standard deviation) of the distribution, or with their **distribution parameters** that inform about the *form* (shape and rate/scale) of the distribution. These estimated values can be reported with their **uncertainty** (95% confidence intervals).

@@ -141,7 +154,7 @@
 | MERS-CoV | 14.08(13.1–15.2) | 2.58(2.50–2.68) | 0.44(0.39–0.5) |
 | COVID-19 | 5.2(4.2–6.5) | 1.45(1.31–1.61) | 0.63(0.54–0.74) |
 
-Table: Serial interval estimates using Gamma, Weibull, and Log normal distributions. 95% confidence intervals for the shape and scale (logmean and sd for Log normal) parameters are shown in brackets ([Althobaity et al., 2022](https://www.sciencedirect.com/science/article/pii/S2468042722000537#tbl3)).
+Table: Serial interval estimates using Gamma, Weibull, and Log Normal distributions. 95% confidence intervals for the shape and scale (logmean and sd for Log Normal) parameters are shown in brackets ([Althobaity et al., 2022](https://www.sciencedirect.com/science/article/pii/S2468042722000537#tbl3)).
 
 :::::::::::::::::::::::::
 
@@ -151,12 +164,12 @@
 
 Assume that COVID-19 and SARS have similar reproduction number values and that the serial interval approximates the generation time. 
 
-Given the Serial interval of both infections in the figure below: 
+Given the serial interval of both infections in the figure below: 
 
 - Which one would be harder to control? 
 - Why do you conclude that?

 ![Serial interval of novel coronavirus (COVID-19) infections overlaid with a published distribution of SARS. ([Nishiura et al., 2020](https://www.ijidonline.com/article/S1201-9712(20)30119-3/fulltext))](fig/serial-interval-covid-sars.jpg)

 ::::::::::::::::: hint

@@ -251,7 +264,7 @@
 
 ::::::::::::::::: spoiler
 
-### Why do we have a 'NA' entry?
+### Why do we have an 'NA' entry?
 
 Entries with a missing value (`<NA>`) in the `prob_distribution` column are *non-parameterised* entries. They have summary statistics but no probability distribution. Compare these two outputs:
 
@@ -633,7 +646,7 @@
 
 ::::::::::::::::: discussion
 
-### The distribution Zoo
+### The distribution zoo
 
 Explore this shinyapp called **The Distribution Zoo**!
 

diff --git a/episodes/describe-cases.Rmd b/episodes/describe-cases.Rmd
@@ -1,5 +1,5 @@
 ---
-title: 'Aggregate and visulaize'
+title: 'Aggregate and visualize'
 teaching: 20
 exercises: 10
 ---
@@ -29,6 +29,19 @@ packages. A key observation in EDA of epidemic analysis is capturing the relatio
 reported cases, spanning various categories (confirmed, hospitalized, deaths, and recoveries), locations, and other 
 demographic factors such as gender, age, etc.  
 
+ ::::::::::::::::::: checklist
+
+### The double-colon
+
+The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment. 
+
+For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
+
+This help us remember package functions and avoid namespace conflicts.
+
+:::::::::::::::::::
+
+
 ## Synthetic outbreak data
 
 To illustrate the process of conducting EDA on outbreak data, we will generate a line list 
@@ -101,8 +114,6 @@ linelist <- simulist::sim_linelist(
   hosp_death_risk = 0.5,
   non_hosp_death_risk = 0.05,
   outbreak_start_date = as.Date("2023-01-01"),
-  add_names = TRUE,
-  add_ct = TRUE,
   outbreak_size = c(1000, 10000),
   population_age = c(1, 90),
   case_type_probs = c(suspected = 0.2, probable = 0.1, confirmed = 0.7),
@@ -166,7 +177,8 @@ dialy_incidence_data_2 <- incidence2::incidence(
 )
 
 # Complete missing dates in the incidence object
-incidence2::complete_dates(dialy_incidence_data_2,
+incidence2::complete_dates(
+  x = dialy_incidence_data_2,
   expand = TRUE,
   fill = 0L, by = 1L,
   allow_POSIXct = FALSE
@@ -186,21 +198,25 @@ library("ggplot2")
 library("tracetheme")
 
 # Plot daily incidence data
-base::plot(dialy_incidence_data) + ggplot2::labs(
-  x = "Time (in days)",
-  y = "Dialy cases"
-) + tracetheme::theme_trace()
+base::plot(dialy_incidence_data) +
+  ggplot2::labs(
+    x = "Time (in days)",
+    y = "Dialy cases"
+  ) +
+  tracetheme::theme_trace()
 ``` 
 
 
 ```{r, message=FALSE, warning=FALSE}
 
 # Plot weekly incidence data
 
-base::plot(weekly_incidence_data) + ggplot2::labs(
-  x = "Time (in days)",
-  y = "weekly cases"
-) + tracetheme::theme_trace()
+base::plot(weekly_incidence_data) +
+  ggplot2::labs(
+    x = "Time (in days)",
+    y = "weekly cases"
+  ) +
+  tracetheme::theme_trace()
 ``` 
 
 ::::::::::::::::::::::::::::::::::::: challenge 

diff --git a/episodes/quantify-transmissibility.Rmd b/episodes/quantify-transmissibility.Rmd
@@ -61,9 +61,11 @@ library(tidyverse)
 
 ### The double-colon
 
-The double-colon `::` in R is used to access functions or objects from a specific package without loading the entire package into the current environment. This allows for a more targeted approach to using package components and helps avoid namespace conflicts.
+The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment. 
 
-`::` lets you call a specific function from a package by explicitly mentioning the package name. For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package without loading the entire package.
+For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
+
+This help us remember package functions and avoid namespace conflicts.
 
 :::::::::::::::::::