Skip to content

Commit

Permalink
differences for PR #39
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Apr 5, 2024
1 parent f5dc5e7 commit 66e4076
Show file tree
Hide file tree
Showing 14 changed files with 15,689 additions and 5 deletions.
221 changes: 221 additions & 0 deletions clean-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
---
title: 'Clean outbreaks data'
teaching: 10
exercises: 2
---

:::::::::::::::::::::::::::::::::::::: questions

- How to clean and standardize case data?
- How to convert raw dataset into a `linelist` object?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Explain how clean, curate, and standardize case data using `{cleanepi}` package
- Demonstrate how to covert case data to `linelist` data

::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction

In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.


The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content.


```r
requireNamespace("rio", quietly = TRUE)
raw_ebola_data <- rio::import(
file.path("episodes", "data", "simulated_ebola_2.csv")
)
```

```{.error}
Error: No such file: episodes/data/simulated_ebola_2.csv
```

```r
utils::head(raw_ebola_data, 5)
```

```{.error}
Error in eval(expr, envir, enclos): object 'raw_ebola_data' not found
```

## Quick inspection

Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:


```r
requireNamespace("cleanepi", quietly = TRUE)
cleanepi::scan_data(raw_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'raw_ebola_data' not found
```


The results provides an overview of the content of every column, including column names, and the percent of some data types per column.
You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.

## Common data cleaning operations

This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package.

### Standardizing column names

For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset.
However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` for more details.


```r
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'raw_ebola_data' not found
```

```r
names(sim_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

::::::::::::::::::::::::::::::::::::: challenge

- What differences you can observe in the column names?

::::::::::::::::::::::::::::::::::::::::::::::::

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged.

**Exercise:** Standardize the column names of the input dataset, but keep the “V1” column as is.

### Removing irregularities

Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}` remove such irregularities as demonstrated in the below code chunk.


```r
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

```r
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.

### Replacing missing values

In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}`:


```r
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

### Validating subject IDs

Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.


```r
# remove this chunk code once {cleanepi} is updated. The coercion made here will be accounted for within {cleanepi}
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```


```r
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
target_columns = "case_id")
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

Note that our simulated dataset does contain duplicated subject IDS.

### Standardizing dates

Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The `{cleanepi}` package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset:


```r
sim_ebola_data <- cleanepi::standardize_dates(sim_ebola_data,
target_columns = c("date_onset",
"date_sample"))
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

```r
utils::head(sim_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if `target_columns = NULL`) and convert them into the **Ymd** format.

### Converting to numeric values

In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words.
The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illustrated in the below code chunk.
Note that this function makes call of functions from the `{numberize}` package.


```r
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
target_columns = "age")
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

```r
utils::head(sim_ebola_data)
```

```{.error}
Error in eval(expr, envir, enclos): object 'sim_ebola_data' not found
```

## Epidemiology related operations
In addition to common data cleansing tasks, such as those discussed in the previous section, the {cleanepi} package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks.
### Dictionary-based substitution

### Calculating age at different time scales

### Calulating age categories

## Multiple operations at once
8 changes: 4 additions & 4 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ contact: '[email protected]'

# Order of episodes in your lesson
episodes:
#- read-cases.Rmd
#- clean-data.Rmd
#- describe-cases.Rmd
#- simple-analysis.Rmd
- read-cases.Rmd
- clean-data.Rmd
- describe-cases.Rmd
- simple-analysis.Rmd
- delays-reuse.Rmd
- quantify-transmissibility.Rmd
- delays-functions.Rmd
Expand Down
Loading

0 comments on commit 66e4076

Please sign in to comment.