Skip to content

Commit

Permalink
differences for PR #70
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Jun 17, 2024
1 parent 0496c4d commit 9fff5f4
Show file tree
Hide file tree
Showing 17 changed files with 382 additions and 192 deletions.
129 changes: 83 additions & 46 deletions clean-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,33 @@ exercises: 10

- Explain how to clean, curate, and standardize case data using `{cleanepi}` package
- Demonstrate how to covert case data to `linelist` data
- Perform essential data-cleaning operations to be performed in a raw case dataset.

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::: prereq

This episode requires you to:

- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv)
- Save it in the `data/` folder.

:::::::::::::::::::::

## Introduction
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and valid to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.

::::::::::::::::::: checklist

### The double-colon

The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment.

For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.

This help us remember package functions and avoid namespace conflicts.

:::::::::::::::::::


The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content.
Expand All @@ -31,9 +53,9 @@ library("rio")
library("here")

# Read data
# e.g.: if path to file is data/raw-data/simulated_ebola_2.csv then:
# e.g.: if path to file is data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
here::here("data", "raw-data", "simulated_ebola_2.csv")
here::here("data", "simulated_ebola_2.csv")
)
```

Expand Down Expand Up @@ -107,15 +129,26 @@ names(sim_ebola_data)

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged.

**Exercise:** Standardize the column names of the input dataset, but keep the “V1” column as is.
::::::::::::::::::::::::::::::::::::: challenge

Standardize the column names of the input dataset, but keep the “V1” column as it is.

::::::::::::::::::::::::::::::::::::::::::::::::

### Removing irregularities

Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}` remove such irregularities as demonstrated in the below code chunk.


```r
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_constants(sim_ebola_data)
```

```{.error}
Error: 'remove_constants' is not an exported object from 'namespace:cleanepi'
```

```r
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
```

Expand All @@ -127,32 +160,35 @@ In addition to the regularities, raw data can contain missing values that may be


```r
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
sim_ebola_data <- cleanepi::replace_missing_values(
data = sim_ebola_data,
na_strings = ""
)
```

### Validating subject IDs

Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.


```r
# remove this chunk code once {cleanepi} is updated.
# The coercion made here will be accounted for within {cleanepi}
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
```


```r
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
target_columns = "case_id",
range = c(0, 15000)
)
sim_ebola_data <-
cleanepi::check_subject_ids(
data = sim_ebola_data,
target_columns = "case_id",
range = c(0, 15000)
)
```

```{.output}
Found 1957 duplicated rows. Please consult the report for more details.
```

```{.error}
Error in parse_vector(x, col_number(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE
```

Note that our simulated dataset does contain duplicated subject IDS.

### Standardizing dates
Expand Down Expand Up @@ -223,8 +259,7 @@ Here's an example code chunk demonstrating the usage of `check_date_sequence()`
```r
sim_ebola_data <- cleanepi::check_date_sequence(
data = sim_ebola_data,
target_columns = c("date_onset", "date_sample"),
remove = TRUE
target_columns = c("date_onset", "date_sample")
)
```

Expand Down Expand Up @@ -286,18 +321,25 @@ This approach simplifies the data cleaning process, ensuring that categorical da

In epidemiological data analysis it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak or the duration between sample collection and analysis.
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the `span()` function to compute the time elapsed since the date of sample for the case identified
until the date this document was generated (2024-05-21).
until the date this document was generated (2024-06-17).


```r
sim_ebola_data <- cleanepi::span(
sim_ebola_data <- cleanepi::timespan(
sim_ebola_data,
target_column = "date_sample",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "time_since_sampling_date",
span_remainder_unit = "months"
)
```

```{.error}
Error: 'timespan' is not an exported object from 'namespace:cleanepi'
```

```r
utils::head(sim_ebola_data)
```

Expand All @@ -309,13 +351,6 @@ utils::head(sim_ebola_data)
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 female <NA> 2014-06-08 2016-10-10
6 6 14274 76 female <NA> <NA> 2016-01-23
time_since_sampling_date remainder_months
1 9 1.52
2 10 4.62
3 9 2.66
4 9 4.72
5 7 7.44
6 8 3.97
```

After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months.
Expand All @@ -330,13 +365,11 @@ The `clean_data()` function applies a series of predefined data cleaning operati
Further more, you can combine multiple data cleaning tasks via the pipe operator in "|>", as shown in the below code snippet.

```r
# remove the line below once Karim has updated cleanepi
raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
# PERFORM THE OPERATIONS USING THE pipe SYNTAX
cleaned_data <- raw_ebola_data |>
cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
cleanepi::replace_missing_values(target_columns = NULL) |>
cleanepi::remove_constant(cutoff = 1.0) |>
cleanepi::replace_missing_values(na_strings = "") |>
cleanepi::remove_constants(cutoff = 1.0) |>
cleanepi::remove_duplicates(target_columns = NULL) |>
cleanepi::standardize_dates(
target_columns = c("date_onset", "date_sample"),
Expand All @@ -352,8 +385,8 @@ cleaned_data <- raw_ebola_data |>
cleanepi::clean_using_dictionary(dictionary = test_dict)
```

```{.output}
Found 1957 duplicated rows. Please consult the report for more details.
```{.error}
Error: 'remove_constants' is not an exported object from 'namespace:cleanepi'
```

## Printing the clean report
Expand Down Expand Up @@ -381,29 +414,33 @@ it's essential to establish an additional foundational layer to ensure the integ

```r
library("linelist")
data <- linelist::make_linelist(cleaned_data,
data <- linelist::make_linelist(
x = cleaned_data,
id = "case_id",
age = "age",
date_onset = "date_onset",
date_reporting = "date_sample",
gender = "gender"
)
```

```{.error}
Error in eval(expr, envir, enclos): object 'cleaned_data' not found
```

```r
utils::head(data, 7)
```

```{.output}
// linelist object
V1 case_id age gender status date_onset date_sample
1 1 14905 90 male confirmed 2015-03-15 2015-04-06
2 2 13043 25 female <NA> <NA> 2014-01-03
3 3 14364 54 female <NA> 2014-02-09 2015-03-03
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 female <NA> 2014-06-08 2016-10-10
6 6 14274 76 female <NA> <NA> 2016-01-23
7 7 14132 16 male confirmed <NA> 2015-10-05
// tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age
1 function (..., list = character(), package = NULL, lib.loc = NULL,
2 verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
3 {
4 fileExt <- function(x) {
5 db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)
6 ans <- sub(".*\\\\.", "", x)
7 ans[db] <- sub(".*\\\\.([^.]+\\\\.)(gz|bz2|xz)$", "\\\\1\\\\2",
```

::::::::::::::::::::::::::::::::::::: keypoints
Expand Down
2 changes: 1 addition & 1 deletion config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
carpentry: 'incubator'

# Overall title for pages.
title: 'Using delays to quantify transmission'
title: 'Read and clean case data, and make linelist for outbreak analytics with R'

# Date the lesson was created (YYYY-MM-DD, this is empty by default)
created:
Expand Down
28 changes: 15 additions & 13 deletions delays-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,11 @@ withr::local_options(list(mc.cores = 4))

### The double-colon

The double-colon `::` in R is used to access functions or objects from a specific package without loading the entire package into the current environment. This allows for a more targeted approach to using package components and helps avoid namespace conflicts.
The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment.

`::` lets you call a specific function from a package by explicitly mentioning the package name. For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package without loading the entire package.
For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.

This help us remember package functions and avoid namespace conflicts.

:::::::::::::::::::

Expand Down Expand Up @@ -160,8 +162,8 @@ generate(covid_serialint, times = 10)
```

```{.output}
[1] 5.016854 2.638262 2.623872 5.541008 4.366855 1.278456 5.674894 1.252402
[9] 5.943339 5.343587
[1] 16.482674 3.429268 6.163090 3.236345 5.161239 3.902167 11.302638
[8] 3.316425 5.806946 6.079917
```

::::::::: instructor
Expand All @@ -177,7 +179,7 @@ Access to the reference documentation (Help files) for these functions is access

::::::::::::::::::::::::::::::::: challenge

### Window for contact tracing and the Serial interval
### Window for contact tracing and the serial interval

The **serial interval** is important in the optimisation of contact tracing since it provides a time window for the containment of a disease spread ([Fine, 2003](https://academic.oup.com/aje/article/158/11/1039/162725)). Depending on the serial interval, we can evaluate the need to expand the number of days pre-onset to consider in the contact tracing to include more backwards contacts ([Davis et al., 2020](https://assets.publishing.service.gov.uk/media/61e9ab3f8fa8f50597fb3078/S0523_Oxford_-_Backwards_contact_tracing.pdf)).

Expand Down Expand Up @@ -305,7 +307,7 @@ covid_serialint_discrete_max <-

::::::::::::::::::::::::::::::::: challenge

### Length of quarantine and Incubation period
### Length of quarantine and incubation period

The **incubation period** distribution is a useful delay to assess the length of active monitoring or quarantine ([Lauer et al., 2020](https://www.acpjournals.org/doi/10.7326/M20-0504)). Similarly, delays from symptom onset to recovery (or death) will determine the required duration of health care and case isolation ([Cori et al., 2017](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)).

Expand Down Expand Up @@ -470,10 +472,10 @@ epinow_estimates_cg <- epinow(
```

```{.output}
WARN [2024-06-03 11:32:27] epinow: There were 12 divergent transitions after warmup. See
WARN [2024-06-17 20:42:40] epinow: There were 2 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-06-03 11:32:27] epinow: Examine the pairs() plot to diagnose sampling problems
WARN [2024-06-17 20:42:40] epinow: Examine the pairs() plot to diagnose sampling problems
-
```

Expand Down Expand Up @@ -505,7 +507,7 @@ The **delay distribution** could be inferred jointly with the underlying times o

::::::::::::::::::::::::::::::::: challenge

### Use an Incubation period for COVID-19 to estimate Rt
### Use an incubation period for COVID-19 to estimate Rt

Estimate the time-varying reproduction number for the first 60 days of the `example_confirmed` data set from `{EpiNow2}`. Access to an incubation period for COVID-19 from `{epiparameter}` to use it as a reporting delay.

Expand Down Expand Up @@ -599,10 +601,10 @@ epinow_estimates_cgi <- epinow(
```

```{.output}
WARN [2024-06-03 11:34:24] epinow: There were 6 divergent transitions after warmup. See
WARN [2024-06-17 20:44:38] epinow: There were 3 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-06-03 11:34:24] epinow: Examine the pairs() plot to diagnose sampling problems
WARN [2024-06-17 20:44:38] epinow: Examine the pairs() plot to diagnose sampling problems
-
```

Expand Down Expand Up @@ -746,10 +748,10 @@ epinow_estimates_egi <- epinow(
```

```{.output}
WARN [2024-06-03 11:38:02] epinow: There were 2 divergent transitions after warmup. See
WARN [2024-06-17 20:48:04] epinow: There were 9 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-06-03 11:38:02] epinow: Examine the pairs() plot to diagnose sampling problems
WARN [2024-06-17 20:48:04] epinow: Examine the pairs() plot to diagnose sampling problems
-
```

Expand Down
Loading

0 comments on commit 9fff5f4

Please sign in to comment.