-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f5dc5e7
commit 66e4076
Showing
14 changed files
with
15,689 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,221 @@ | ||
--- | ||
title: 'Clean outbreaks data' | ||
teaching: 10 | ||
exercises: 2 | ||
--- | ||
|
||
:::::::::::::::::::::::::::::::::::::: questions | ||
|
||
- How to clean and standardize case data? | ||
- How to convert raw dataset into a `linelist` object? | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
::::::::::::::::::::::::::::::::::::: objectives | ||
|
||
- Explain how clean, curate, and standardize case data using `{cleanepi}` package | ||
- Demonstrate how to covert case data to `linelist` data | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
## Introduction | ||
|
||
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases. | ||
|
||
|
||
The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content. | ||
|
||
|
||
```r | ||
requireNamespace("rio", quietly = TRUE) | ||
raw_ebola_data <- rio::import( | ||
file.path("episodes", "data", "simulated_ebola_2.csv") | ||
) | ||
``` | ||
|
||
```{.error} | ||
Error: No such file: episodes/data/simulated_ebola_2.csv | ||
``` | ||
|
||
```r | ||
utils::head(raw_ebola_data, 5) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'raw_ebola_data' not found | ||
``` | ||
|
||
## Quick inspection | ||
|
||
Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it: | ||
|
||
|
||
```r | ||
requireNamespace("cleanepi", quietly = TRUE) | ||
cleanepi::scan_data(raw_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'raw_ebola_data' not found | ||
``` | ||
|
||
|
||
The results provides an overview of the content of every column, including column names, and the percent of some data types per column. | ||
You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others. | ||
|
||
## Common data cleaning operations | ||
|
||
This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package. | ||
|
||
### Standardizing column names | ||
|
||
For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset. | ||
However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` for more details. | ||
|
||
|
||
```r | ||
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'raw_ebola_data' not found | ||
``` | ||
|
||
```r | ||
names(sim_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
::::::::::::::::::::::::::::::::::::: challenge | ||
|
||
- What differences you can observe in the column names? | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged. | ||
|
||
**Exercise:** Standardize the column names of the input dataset, but keep the “V1” column as is. | ||
|
||
### Removing irregularities | ||
|
||
Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}` remove such irregularities as demonstrated in the below code chunk. | ||
|
||
|
||
```r | ||
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
```r | ||
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
Note that, our simulated Ebola does not contain duplicated nor constant rows or columns. | ||
|
||
### Replacing missing values | ||
|
||
In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}`: | ||
|
||
|
||
```r | ||
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
### Validating subject IDs | ||
|
||
Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria. | ||
|
||
|
||
```r | ||
# remove this chunk code once {cleanepi} is updated. The coercion made here will be accounted for within {cleanepi} | ||
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
|
||
```r | ||
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data, | ||
target_columns = "case_id") | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
Note that our simulated dataset does contain duplicated subject IDS. | ||
|
||
### Standardizing dates | ||
|
||
Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The `{cleanepi}` package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset: | ||
|
||
|
||
```r | ||
sim_ebola_data <- cleanepi::standardize_dates(sim_ebola_data, | ||
target_columns = c("date_onset", | ||
"date_sample")) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
```r | ||
utils::head(sim_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if `target_columns = NULL`) and convert them into the **Ymd** format. | ||
|
||
### Converting to numeric values | ||
|
||
In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words. | ||
The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illustrated in the below code chunk. | ||
Note that this function makes call of functions from the `{numberize}` package. | ||
|
||
|
||
```r | ||
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data, | ||
target_columns = "age") | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
```r | ||
utils::head(sim_ebola_data) | ||
``` | ||
|
||
```{.error} | ||
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found | ||
``` | ||
|
||
## Epidemiology related operations | ||
In addition to common data cleansing tasks, such as those discussed in the previous section, the {cleanepi} package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks. | ||
### Dictionary-based substitution | ||
|
||
### Calculating age at different time scales | ||
|
||
### Calulating age categories | ||
|
||
## Multiple operations at once |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,10 +59,10 @@ contact: '[email protected]' | |
|
||
# Order of episodes in your lesson | ||
episodes: | ||
#- read-cases.Rmd | ||
#- clean-data.Rmd | ||
#- describe-cases.Rmd | ||
#- simple-analysis.Rmd | ||
- read-cases.Rmd | ||
- clean-data.Rmd | ||
- describe-cases.Rmd | ||
- simple-analysis.Rmd | ||
- delays-reuse.Rmd | ||
- quantify-transmissibility.Rmd | ||
- delays-functions.Rmd | ||
|
Oops, something went wrong.