Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read describe cases #39

Merged
merged 67 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
17b8d2e
Fix render issues in read-cases file
Degoot-AM Mar 27, 2024
9e8a0d7
Adding early simple analysis episode
Degoot-AM Mar 27, 2024
c7118ef
Reading citation file via file.path command to satisfy lintr requirem…
Degoot-AM Mar 27, 2024
9da41f7
Adding peak time section
Degoot-AM Mar 27, 2024
96a3a12
Adding moving averge section
Degoot-AM Mar 27, 2024
2822131
Adding data cleaning episode
Degoot-AM Mar 28, 2024
91a3052
adding simulated Ebola cases
Degoot-AM Mar 28, 2024
3bf2fc9
adding skim section
Degoot-AM Mar 28, 2024
f5fc027
the cleanepi problem
Degoot-AM Mar 28, 2024
9ec8cf7
importing packages quietly
Degoot-AM Apr 1, 2024
e016c42
Merge branch 'main' into fix-read-describe-cases
Degoot-AM Apr 1, 2024
2cc9158
add colunm name stuff
Degoot-AM Apr 3, 2024
e3c92cb
replacing missing values
Degoot-AM Apr 4, 2024
11562f2
add converting to numeric values
Degoot-AM Apr 4, 2024
38cb2be
add sectioning
Degoot-AM Apr 4, 2024
3d23a96
update episodes/clean-data.Rmd
Karim-Mane Apr 5, 2024
87f9aca
update episodes/describe-cases.Rmd
Karim-Mane Apr 5, 2024
c900566
update episodes/read-cases.Rmd
Karim-Mane Apr 5, 2024
48c4bcb
update episodes/simple-analysis.Rmd
Karim-Mane Apr 5, 2024
275ac5a
fix linters
Karim-Mane Apr 5, 2024
cb586d2
linting all files
Degoot-AM Apr 5, 2024
c6b77b9
add dictionary_based substitution section
Degoot-AM Apr 6, 2024
7ad7914
Add time span section
Degoot-AM Apr 6, 2024
bc46bc8
add check sequence of dates section
Degoot-AM Apr 6, 2024
4e14722
add multiple operations at once section.
Degoot-AM Apr 7, 2024
afb867e
review episodes/clean-data.Rmd
Karim-Mane Apr 8, 2024
8f67960
fix linters
Karim-Mane Apr 8, 2024
16e513a
add reporting sections
Degoot-AM Apr 8, 2024
8326df6
add linelist section.
Degoot-AM Apr 8, 2024
d392b6f
add print report
Degoot-AM Apr 9, 2024
1158b34
add EDA
Degoot-AM Apr 9, 2024
c1985c7
add simulation data
Degoot-AM Apr 9, 2024
334db52
add synthetic data
Degoot-AM Apr 9, 2024
230fc5f
add tracetheme
Degoot-AM Apr 9, 2024
5836db8
add key points section
Degoot-AM Apr 9, 2024
de6cbe9
add questions section
Degoot-AM Apr 9, 2024
7a50094
add epi-curves
Degoot-AM Apr 9, 2024
d84cc5a
add complete data
Degoot-AM Apr 9, 2024
5eaa916
change title
Degoot-AM Apr 16, 2024
3b0d452
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
fb4ceb4
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
4f6b53f
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
ab543ae
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
88a98f0
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
17e1680
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
6b42190
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
0490700
replace library
Degoot-AM Apr 16, 2024
c3ac84b
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
cf31e1b
fix here issues
Degoot-AM Apr 16, 2024
4a30890
fix issues
Degoot-AM Apr 16, 2024
f9d6e58
fix lintr issues
Degoot-AM Apr 16, 2024
91b5c36
replace laodnamespace with library
Degoot-AM Apr 17, 2024
5fe1350
replace library
Degoot-AM Apr 17, 2024
b792a69
add keypoints section
Degoot-AM Apr 17, 2024
3d44420
change early to simple
Degoot-AM Apr 17, 2024
5fc05de
adjust times for teaching and exercises
Degoot-AM Apr 18, 2024
486775b
Update episodes/describe-cases.Rmd
Degoot-AM Apr 24, 2024
9d5661b
Update episodes/read-cases.Rmd
Degoot-AM Apr 24, 2024
62b7b45
Update episodes/read-cases.Rmd
Degoot-AM Apr 24, 2024
aa86824
Update episodes/clean-data.Rmd
Degoot-AM Apr 24, 2024
42a74eb
Update episodes/clean-data.Rmd
Degoot-AM Apr 24, 2024
d9e1738
separate dev denvironment from learning environment
Degoot-AM Apr 25, 2024
05228be
added zip file
Degoot-AM Apr 25, 2024
65f5d3d
fix zip data issues
Degoot-AM Apr 26, 2024
0faddc4
add iinline spaces and explanations
Degoot-AM Apr 26, 2024
e5c9a43
fix linter issue
Degoot-AM Apr 26, 2024
1f956a0
fix error messages in preliminary rendering
avallecam Apr 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .lintr
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,9 @@ linters: all_linters(
default_undesirable_functions,
library = NULL # this is fine in Rmd files
)
),
unused_import_linter(
allow_ns_usage = TRUE,
except_packages = c("bit64", "data.table", "tidyverse")
)
)
8 changes: 4 additions & 4 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ contact: '[email protected]'

# Order of episodes in your lesson
episodes:
#- read-cases.Rmd
#- clean-data.Rmd
#- describe-cases.Rmd
#- simple-analysis.Rmd
- read-cases.Rmd
- clean-data.Rmd
- describe-cases.Rmd
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved
- simple-analysis.Rmd
- delays-reuse.Rmd
- delays-functions.Rmd

Expand Down
272 changes: 257 additions & 15 deletions episodes/clean-data.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: 'Clean outbreaks data'
teaching: 10
exercises: 2
title: 'Clean and validate'
teaching: 20
exercises: 10
---

:::::::::::::::::::::::::::::::::::::: questions
Expand All @@ -13,30 +13,272 @@

::::::::::::::::::::::::::::::::::::: objectives

- Explain how clean, curate, and standardize case data using `{cleanepi}` package
- Demonstrate how to covert case data to a `linelist` object
- Explain how to clean, curate, and standardize case data using `{cleanepi}` package
- Demonstrate how to covert case data to `linelist` data

::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, and standardized to facilitate accurate and reproducible analysis. To achieve this, we will utilize the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.


The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content.

```{r, message=FALSE}
library("rio")
library("here")
raw_ebola_data <- rio::import(
here::here("built", "data", "simulated_ebola_2.csv")
)
utils::head(raw_ebola_data, 5)
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved
```

## A quick inspection

Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:

```{r}
requireNamespace("rio")
sim_ebola_data <- rio::import(file.path("data", "simulated_ebola.csv",
fsep = "/"))
utils::head(sim_ebola_data, 5)
library("cleanepi")
cleanepi::scan_data(raw_ebola_data)
```

## Quick inspection
Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:

The results provides an overview of the content of every column, including column names, and the percent of some data types per column.
You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.

## Common operations

This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package.

### Standardizing column names

For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset.
However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` for more details.

```{r}
requireNamespace("cleanepi")
cleanepi::scan_data(sim_ebola_data)
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)
```

::::::::::::::::::::::::::::::::::::: challenge

- What differences you can observe in the column names?
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved

::::::::::::::::::::::::::::::::::::::::::::::::

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged.

**Exercise:** Standardize the column names of the input dataset, but keep the “V1” column as is.
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved

### Removing irregularities

Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}` remove such irregularities as demonstrated in the below code chunk.

```{r}
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
```

Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.

### Replacing missing values

In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}`:

```{r}
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
```

### Validating subject IDs

Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.

```{r}
# remove this chunk code once `{cleanepi}` is updated. The coercion made here
# cwill be accounted for within `{cleanepi}`
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved
```

```{r}
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
target_columns = "case_id",
range = c(0, 15000)
)
```

Note that our simulated dataset does contain duplicated subject IDS.

### Standardizing dates

Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The `{cleanepi}` package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset:

```{r}
sim_ebola_data <- cleanepi::standardize_dates(
sim_ebola_data,
target_columns = c(
"date_onset",
"date_sample"
)
)

utils::head(sim_ebola_data)
```

This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if `target_columns = NULL`) and convert them into the **Ymd** format.

### Converting to numeric values

In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words.
The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illustrated in the below code chunk.
```{r}
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
target_columns = "age"
)
utils::head(sim_ebola_data)
```

## Epidemiology related operations

In addition to common data cleansing tasks, such as those discussed in the above section, the `{cleanepi}` package offers
additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section
covers some of these specialized tasks.

### Checking sequence of dated-events

Ensuring the correct order and sequence of dated events is crucial in epidemiological data analysis, especially
when analyzing infectious diseases where the timing of events like symptom onset and sample collection is essential.
The `{cleanepi}` package provides a helpful function called `check_date_sequence()` precisely for this purpose.

Here's an example code chunk demonstrating the usage of `check_date_sequence()` function in our simulated Ebola dataset
```{r, warning=FALSE}
sim_ebola_data <- cleanepi::check_date_sequence(
data = sim_ebola_data,
target_columns = c("date_onset", "date_sample"),
remove = TRUE
)
```

This functionality is crucial for ensuring data integrity and accuracy in epidemiological analyses, as it helps identify
any inconsistencies or errors in the chronological order of events, allowing yor to address them appropriately.

### Dictionary-based substitution

In the realm of data pre-processing, it's common to encounter scenarios where certain columns in a dataset, such as the “gender” column in our simulated Ebola dataset,
are expected to have specific values or factors. However, it's also common for unexpected or erroneous values to appear in these columns, which need to be replaced with appropriate values. The `{cleanepi}` package offers support for dictionary-based substitution, a method that allows you to replace values in specific columns based on mappings defined in a dictionary.
This approach ensures consistency and accuracy in data cleaning.

Moreover, `{cleanepi}` provides a built-in dictionary specifically tailored for epidemiological data. The example dictionary below includes mappings for the “gender” column.

```{r}
test_dict <- base::readRDS(
system.file("extdata", "test_dict.RDS", package = "cleanepi")
)
base::print(test_dict)
```

Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to utilize this functionality:

```{r}
sim_ebola_data <- cleanepi::clean_using_dictionary(
sim_ebola_data,
dictionary = test_dict
)
utils::head(sim_ebola_data)
```

This approach simplifies the data cleaning process, ensuring that categorical data in epidemiological datasets is accurately categorized and ready for further analysis.

> Note that, when the column in the dataset contains values that are not in the dictionary, the clean_using_dictionary() will raise an error. Users can use the cleanepi::add_to_dictionary() function to include the missing value into the dictionary. See the corresponding section in the package [vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html) for more details.
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved

### Calculating time span between different date events

In epidemiological data analysis it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak or the duration between sample collection and analysis.
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the `span()` function to compute the time elapsed since the date of sample for the case identified
until the date this document was generated (`r Sys.Date()`).

```{r}
sim_ebola_data <- cleanepi::span(
sim_ebola_data,
target_column = "date_sample",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "time_since_sampling_date",
span_remainder_unit = "months"
)
utils::head(sim_ebola_data)
```

After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months.

## Multiple operations at once

Performing data cleaning operations individually can be time-consuming and error-prone. The `{cleanepi}` package simplifies this process by offering a convenient wrapper function called `clean_data()`, which allows you to perform multiple operations at once.

The `clean_data()` function applies a series of predefined data cleaning operations to the input dataset. Here's an example code chunk illustrating how to use `clean_data()` on a raw simulated Ebola dataset:


Further more, you can combine multiple data cleaning tasks via the pipe operator in "|>", as shown in the below code snippet.
```{r}
# remove the line below once Karim has updated cleanepi
raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
# PERFORM THE OPERATIONS USING THE pipe SYNTAX
cleaned_data <- raw_ebola_data |>
cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
cleanepi::replace_missing_values(target_columns = NULL) |>
cleanepi::remove_constant(cutoff = 1.0) |>
cleanepi::remove_duplicates(target_columns = NULL) |>
cleanepi::standardize_dates(
target_columns = c("date_onset", "date_sample"),
error_tolerance = 0.4,
format = NULL,
timeframe = NULL
) |>
cleanepi::check_subject_ids(
target_columns = "case_id",
range = c(1, 15000)
) |>
cleanepi::convert_to_numeric(target_columns = "age") |>
cleanepi::clean_using_dictionary(dictionary = test_dict)
```

## Printing the clean report

The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing
operations conducted during the analysis. This report is presented as a webpage with multiple sections. Each section
corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of
that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of
individual cleansing steps within the broader data cleansing process.

You can view the report using `cleanepi::print_report()` function.

![Example of data cleaning report generated by `{cleanepi}`](fig/report_demo.png)

Check warning on line 254 in episodes/clean-data.Rmd

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: fig/report_demo.png

## Validating and tagging case data
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
it's essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent
analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within
your dataset, a process commonly referred to as "tagging." Additionally, it's crucial to implement measures to
validate that these tagged columns are not inadvertently deleted during further data processing steps.

This is achieved by converting the cleaned case data into a `linelist` object using `{linelist}` package, see the
below code chunk.

```{r}
library("linelist")
data <- linelist::make_linelist(cleaned_data,
id = "case_id",
age = "age",
date_onset = "date_onset",
date_reporting = "date_sample",
gender = "gender"
)
utils::head(data, 7)
```

::::::::::::::::::::::::::::::::::::: keypoints

- Use `{cleanepi}` package to clean and standardize epidemic and outbreak data
- Use `{linelist}` to tagg, validate, and prepare case data for downstream analysis.

::::::::::::::::::::::::::::::::::::::::::::::::

The results provides a summary of each column, including column names, data types, number of missing values, and summary statistics for numerical columns.
Loading
Loading