epiverse-trace · Degoot-AM · Apr 29, 2024 · Mar 27, 2024 · Mar 27, 2024 · Mar 27, 2024
diff --git a/.lintr b/.lintr
@@ -12,5 +12,9 @@ linters: all_linters(
         default_undesirable_functions,
         library = NULL # this is fine in Rmd files
       )
+    ),
+   unused_import_linter(
+      allow_ns_usage = TRUE,
+      except_packages = c("bit64", "data.table", "tidyverse")
     )
   )
diff --git a/config.yaml b/config.yaml
@@ -59,10 +59,10 @@ contact: '[email protected]'
 
 # Order of episodes in your lesson
 episodes:
-#- read-cases.Rmd
-#- clean-data.Rmd
-#- describe-cases.Rmd
-#- simple-analysis.Rmd
+- read-cases.Rmd
+- clean-data.Rmd
+- describe-cases.Rmd
+- simple-analysis.Rmd
 - delays-reuse.Rmd
 - delays-functions.Rmd
 

diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd
@@ -1,7 +1,7 @@
 ---
-title: 'Clean outbreaks data'
-teaching: 10
-exercises: 2
+title: 'Clean and validate'
+teaching: 20
+exercises: 10
 ---
 
 :::::::::::::::::::::::::::::::::::::: questions 
@@ -13,30 +13,272 @@
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how clean, curate, and standardize case data using `{cleanepi}` package
-- Demonstrate how to covert case data to a `linelist` object 
+- Explain how to clean, curate, and standardize case data using `{cleanepi}` package
+- Demonstrate how to covert case data to `linelist` data 
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ## Introduction
-In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, and standardized to facilitate accurate and reproducible analysis. To achieve this, we will utilize the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
+In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
+
 
 The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content. 
 
+```{r, message=FALSE}
+library("rio")
+library("here")
+raw_ebola_data <- rio::import(
+  here::here("built", "data", "simulated_ebola_2.csv")
+)
+utils::head(raw_ebola_data, 5)
+```
+
+##  A quick inspection
+
+Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:
+
 ```{r}
-requireNamespace("rio")
-sim_ebola_data <- rio::import(file.path("data", "simulated_ebola.csv",
-                                        fsep = "/"))
-utils::head(sim_ebola_data, 5)
+library("cleanepi")
+cleanepi::scan_data(raw_ebola_data)
 ```
 
-##  Quick inspection
-Quick exploration and inspection of the dataset are crucial before diving into any  analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:
+
+The results provides an overview of the content of every column, including column names, and the percent of some data types per column.
+You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.
+
+## Common operations
+
+This section  demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package.
+
+### Standardizing column names
+
+For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset.
+However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` for more details.
 
 ```{r}
-requireNamespace("cleanepi")
-cleanepi::scan_data(sim_ebola_data)
+sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
+names(sim_ebola_data)
 ```
 
+::::::::::::::::::::::::::::::::::::: challenge 
+
+- What differences you can observe in the column names?
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged.
+
+**Exercise:** Standardize the column names of the input dataset, but keep the “V1” column as is.
+
+### Removing irregularities
+
+Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}`  remove such irregularities as demonstrated in the below code chunk. 
+
+```{r}
+sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
+sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
+```
+
+Note that, our simulated Ebola does not contain duplicated nor constant rows or columns. 
+
+### Replacing missing values
+
+In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}`:
+
+```{r}
+sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
+```
+
+### Validating subject IDs
+
+Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.
+
+```{r}
+# remove this chunk code once `{cleanepi}` is updated. The coercion made here
+# cwill be accounted for within `{cleanepi}`
+sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
+```
+
+```{r}
+sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
+  target_columns = "case_id",
+  range = c(0, 15000)
+)
+```
+
+Note that our simulated  dataset does contain duplicated subject IDS.
+
+### Standardizing dates
+
+Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The `{cleanepi}` package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset:
+
+```{r}
+sim_ebola_data <- cleanepi::standardize_dates(
+  sim_ebola_data,
+  target_columns = c(
+    "date_onset",
+    "date_sample"
+  )
+)
+
+utils::head(sim_ebola_data)
+```
+
+This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if `target_columns = NULL`) and convert them into the **Ymd**  format.
+
+### Converting to numeric values
+
+In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words. 
+The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illustrated in the below code chunk.
+```{r}
+sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
+  target_columns = "age"
+)
+utils::head(sim_ebola_data)
+```
+
+## Epidemiology related operations
+
+In addition to common data cleansing tasks, such as those discussed in the above section, the `{cleanepi}` package offers 
+additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section 
+covers some of these specialized tasks.
+
+### Checking sequence of dated-events
+
+Ensuring the correct order and sequence of dated events is crucial in epidemiological data analysis, especially 
+when analyzing infectious diseases where the timing of events like symptom onset and sample collection is essential. 
+The `{cleanepi}` package provides a helpful function called `check_date_sequence()` precisely for this purpose.
+
+Here's an example code chunk demonstrating the usage of `check_date_sequence()` function in our simulated Ebola dataset
+ ```{r, warning=FALSE}
+sim_ebola_data <- cleanepi::check_date_sequence(
+  data = sim_ebola_data,
+  target_columns = c("date_onset", "date_sample"),
+  remove = TRUE
+)
+ ```
+
+This functionality is crucial for ensuring data integrity and accuracy in epidemiological analyses, as it helps identify 
+any inconsistencies or errors in the chronological order of events, allowing yor to address them appropriately.
+
+### Dictionary-based substitution
+
+In the realm of data pre-processing, it's common to encounter scenarios where certain columns in a dataset, such as the “gender” column in our simulated Ebola dataset,
+are expected to have specific values or factors. However, it's also common for unexpected or erroneous values to appear in these columns, which need to be replaced with appropriate values. The `{cleanepi}` package offers support for dictionary-based substitution, a method that allows you to replace values in specific columns based on mappings defined in a dictionary. 
+This approach ensures consistency and accuracy in data cleaning.
+
+Moreover, `{cleanepi}` provides a built-in dictionary specifically tailored for epidemiological data. The example dictionary below includes mappings for the “gender” column.
+
+```{r}
+test_dict <- base::readRDS(
+  system.file("extdata", "test_dict.RDS", package = "cleanepi")
+)
+base::print(test_dict)
+```
+
+Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to utilize this functionality:
+
+```{r}
+sim_ebola_data <- cleanepi::clean_using_dictionary(
+  sim_ebola_data,
+  dictionary = test_dict
+)
+utils::head(sim_ebola_data)
+```
+
+This approach simplifies the data cleaning process, ensuring that categorical data in epidemiological datasets is accurately categorized and ready for further analysis.
+
+> Note that, when the column in the dataset contains values that are not in the dictionary, the clean_using_dictionary() will raise an error. Users can use the cleanepi::add_to_dictionary() function to include the missing value into the dictionary. See the corresponding section in the package [vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html) for more details.
+
+### Calculating time span between different date events
+
+In epidemiological data analysis it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak or the duration between sample collection and analysis.
+The `{cleanepi}` package  offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the `span()` function to compute the time elapsed since the date of sample for the case identified
+ until the date this document was generated (`r Sys.Date()`).
+
+```{r}
+sim_ebola_data <- cleanepi::span(
+  sim_ebola_data,
+  target_column = "date_sample",
+  end_date = Sys.Date(),
+  span_unit = "years",
+  span_column_name = "time_since_sampling_date",
+  span_remainder_unit = "months"
+)
+utils::head(sim_ebola_data)
+```
+
+After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months.
+
+## Multiple operations at once
+
+Performing data cleaning operations individually can be time-consuming and error-prone. The `{cleanepi}` package simplifies this process by offering a convenient wrapper function called `clean_data()`, which allows you to perform multiple operations at once.
+
+The `clean_data()` function applies a series of predefined data cleaning operations to the input dataset. Here's an example code chunk illustrating how to use `clean_data()` on a raw simulated Ebola dataset:
+
+
+Further more, you can combine multiple data cleaning tasks via the pipe operator in "|>", as shown in the below code snippet. 
+```{r}
+# remove the line below once Karim has updated cleanepi
+raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
+# PERFORM THE OPERATIONS USING THE pipe SYNTAX
+cleaned_data <- raw_ebola_data |>
+  cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
+  cleanepi::replace_missing_values(target_columns = NULL) |>
+  cleanepi::remove_constant(cutoff = 1.0) |>
+  cleanepi::remove_duplicates(target_columns = NULL) |>
+  cleanepi::standardize_dates(
+    target_columns = c("date_onset", "date_sample"),
+    error_tolerance = 0.4,
+    format = NULL,
+    timeframe = NULL
+  ) |>
+  cleanepi::check_subject_ids(
+    target_columns = "case_id",
+    range = c(1, 15000)
+  ) |>
+  cleanepi::convert_to_numeric(target_columns = "age") |>
+  cleanepi::clean_using_dictionary(dictionary = test_dict)
+```
+
+## Printing the clean report
+
+The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing 
+operations conducted during the analysis. This report is presented as a webpage with multiple sections. Each section 
+corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of 
+that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of 
+individual cleansing steps within the broader data cleansing process.
+
+You can view the report using `cleanepi::print_report()` function. 
+
+![Example of data cleaning report generated by `{cleanepi}`](fig/report_demo.png)
+
+## Validating and tagging case data
+In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
+it's essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent
+  analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within
+  your dataset, a process commonly referred to as "tagging." Additionally, it's crucial to implement measures to 
+  validate that these tagged columns are not inadvertently deleted during further data processing steps.
+
+  This is achieved by converting the cleaned case data into a `linelist` object using `{linelist}` package, see the 
+  below code chunk.
+
+```{r}
+library("linelist")
+data <- linelist::make_linelist(cleaned_data,
+  id = "case_id",
+  age = "age",
+  date_onset = "date_onset",
+  date_reporting = "date_sample",
+  gender = "gender"
+)
+utils::head(data, 7)
+```
+
+::::::::::::::::::::::::::::::::::::: keypoints 
+
+- Use `{cleanepi}` package to clean and standardize epidemic and outbreak data
+- Use `{linelist}` to tagg, validate, and prepare case data for downstream analysis.
+
+::::::::::::::::::::::::::::::::::::::::::::::::
 
-The results provides a summary of each column, including column names, data types, number of missing values, and summary statistics for numerical columns.