Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read describe cases #39

Merged
merged 67 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
17b8d2e
Fix render issues in read-cases file
Degoot-AM Mar 27, 2024
9e8a0d7
Adding early simple analysis episode
Degoot-AM Mar 27, 2024
c7118ef
Reading citation file via file.path command to satisfy lintr requirem…
Degoot-AM Mar 27, 2024
9da41f7
Adding peak time section
Degoot-AM Mar 27, 2024
96a3a12
Adding moving averge section
Degoot-AM Mar 27, 2024
2822131
Adding data cleaning episode
Degoot-AM Mar 28, 2024
91a3052
adding simulated Ebola cases
Degoot-AM Mar 28, 2024
3bf2fc9
adding skim section
Degoot-AM Mar 28, 2024
f5fc027
the cleanepi problem
Degoot-AM Mar 28, 2024
9ec8cf7
importing packages quietly
Degoot-AM Apr 1, 2024
e016c42
Merge branch 'main' into fix-read-describe-cases
Degoot-AM Apr 1, 2024
2cc9158
add colunm name stuff
Degoot-AM Apr 3, 2024
e3c92cb
replacing missing values
Degoot-AM Apr 4, 2024
11562f2
add converting to numeric values
Degoot-AM Apr 4, 2024
38cb2be
add sectioning
Degoot-AM Apr 4, 2024
3d23a96
update episodes/clean-data.Rmd
Karim-Mane Apr 5, 2024
87f9aca
update episodes/describe-cases.Rmd
Karim-Mane Apr 5, 2024
c900566
update episodes/read-cases.Rmd
Karim-Mane Apr 5, 2024
48c4bcb
update episodes/simple-analysis.Rmd
Karim-Mane Apr 5, 2024
275ac5a
fix linters
Karim-Mane Apr 5, 2024
cb586d2
linting all files
Degoot-AM Apr 5, 2024
c6b77b9
add dictionary_based substitution section
Degoot-AM Apr 6, 2024
7ad7914
Add time span section
Degoot-AM Apr 6, 2024
bc46bc8
add check sequence of dates section
Degoot-AM Apr 6, 2024
4e14722
add multiple operations at once section.
Degoot-AM Apr 7, 2024
afb867e
review episodes/clean-data.Rmd
Karim-Mane Apr 8, 2024
8f67960
fix linters
Karim-Mane Apr 8, 2024
16e513a
add reporting sections
Degoot-AM Apr 8, 2024
8326df6
add linelist section.
Degoot-AM Apr 8, 2024
d392b6f
add print report
Degoot-AM Apr 9, 2024
1158b34
add EDA
Degoot-AM Apr 9, 2024
c1985c7
add simulation data
Degoot-AM Apr 9, 2024
334db52
add synthetic data
Degoot-AM Apr 9, 2024
230fc5f
add tracetheme
Degoot-AM Apr 9, 2024
5836db8
add key points section
Degoot-AM Apr 9, 2024
de6cbe9
add questions section
Degoot-AM Apr 9, 2024
7a50094
add epi-curves
Degoot-AM Apr 9, 2024
d84cc5a
add complete data
Degoot-AM Apr 9, 2024
5eaa916
change title
Degoot-AM Apr 16, 2024
3b0d452
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
fb4ceb4
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
4f6b53f
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
ab543ae
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
88a98f0
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
17e1680
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
6b42190
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
0490700
replace library
Degoot-AM Apr 16, 2024
c3ac84b
Update episodes/read-cases.Rmd
Degoot-AM Apr 16, 2024
cf31e1b
fix here issues
Degoot-AM Apr 16, 2024
4a30890
fix issues
Degoot-AM Apr 16, 2024
f9d6e58
fix lintr issues
Degoot-AM Apr 16, 2024
91b5c36
replace laodnamespace with library
Degoot-AM Apr 17, 2024
5fe1350
replace library
Degoot-AM Apr 17, 2024
b792a69
add keypoints section
Degoot-AM Apr 17, 2024
3d44420
change early to simple
Degoot-AM Apr 17, 2024
5fc05de
adjust times for teaching and exercises
Degoot-AM Apr 18, 2024
486775b
Update episodes/describe-cases.Rmd
Degoot-AM Apr 24, 2024
9d5661b
Update episodes/read-cases.Rmd
Degoot-AM Apr 24, 2024
62b7b45
Update episodes/read-cases.Rmd
Degoot-AM Apr 24, 2024
aa86824
Update episodes/clean-data.Rmd
Degoot-AM Apr 24, 2024
42a74eb
Update episodes/clean-data.Rmd
Degoot-AM Apr 24, 2024
d9e1738
separate dev denvironment from learning environment
Degoot-AM Apr 25, 2024
05228be
added zip file
Degoot-AM Apr 25, 2024
65f5d3d
fix zip data issues
Degoot-AM Apr 26, 2024
0faddc4
add iinline spaces and explanations
Degoot-AM Apr 26, 2024
e5c9a43
fix linter issue
Degoot-AM Apr 26, 2024
1f956a0
fix error messages in preliminary rendering
avallecam Apr 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,537 changes: 3,537 additions & 0 deletions cleanepi_report__2024-04-04Thut_15-26-23.html

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ contact: '[email protected]'

# Order of episodes in your lesson
episodes:
#- read-cases.Rmd
#- clean-data.Rmd
#- describe-cases.Rmd
#- simple-analysis.Rmd
- read-cases.Rmd
- clean-data.Rmd
- describe-cases.Rmd
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved
- simple-analysis.Rmd
- delays-reuse.Rmd
- delays-functions.Rmd

Expand Down
103 changes: 95 additions & 8 deletions episodes/clean-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,116 @@ exercises: 2
::::::::::::::::::::::::::::::::::::: objectives

- Explain how clean, curate, and standardize case data using `{cleanepi}` package
- Demonstrate how to covert case data to a `linelist` object
- Demonstrate how to covert case data to `linelist` data

::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, and standardized to facilitate accurate and reproducible analysis. To achieve this, we will utilize the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved


The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content.

```{r}
requireNamespace("rio")
sim_ebola_data <- rio::import(file.path("data", "simulated_ebola.csv",
requireNamespace("rio", quietly = TRUE)
raw_ebola_data <- rio::import(file.path("data", "simulated_ebola_2.csv",
fsep = "/"))
utils::head(sim_ebola_data, 5)
utils::head(raw_ebola_data, 5)
```

## Quick inspection
Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:

```{r}
requireNamespace("cleanepi")
cleanepi::scan_data(sim_ebola_data)
requireNamespace("cleanepi", quietly = TRUE)
cleanepi::scan_data(raw_ebola_data)
```


The results provides a summary of each column, including column names, data types, and number of missing values. You can see that the column names in the dataset are descriptive but lack consistency, as they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values present.

## Common data cleaning operations

This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}`.

### Standardizing column names

Standardizing column names typically involves removing spaces and connecting different words with a special character such as an underscore (_) or a dot (.). This practice helps maintain consistency and readability in the dataset.

To that extend, the `{cleanepi}` package provides the `standardize_column_names()` for standardizing and reformatting column names.
```{r, }
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)
```

::::::::::::::::::::::::::::::::::::: challenge

- What differences you can observe in the column names?
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved

::::::::::::::::::::::::::::::::::::::::::::::::

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter in the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged.

### Removing irregularities

Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}` remove such irregularities as demonstrated in the below code chunk.

```{r}
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
```

Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.

### Replacing missing values

In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by a unique string, usually denoted by `NA`, in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}`:

```{r}
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
```

### Validating subject IDs

Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range or containing certain prefixes and suffixes. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.

```{r}
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
Degoot-AM marked this conversation as resolved.
Show resolved Hide resolved
target_columns = "case_id")
```

Note that our simulated dataset does contain duplicated subject IDS.

### Standardizing dates

Certainly an epidemic dataset contains date columns for different events, such the date of infection, date of symptoms onset, ..etc, and these dates can come with different date from, and it good practice to unify them. The `{cleanepi}` package provides functionality for unifying date columns in epidemic datasets, ensuring consistency across different date formats. Here's how you can use it on our simulated dataset:

```{r}
sim_ebola_data <- cleanepi::standardize_dates(sim_ebola_data,
target_columns = c("date_onset", "date_sample"))

utils::head(sim_ebola_data)
```

This function covert the list of given columns, or will automatically figure out the date columns, within the dataset to a **YMD** format or any other specified date format.

### Converting to numeric values

In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert them explicitly to be numeric. For example, in our simulated data set, in the age column some entries are written in words.
The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illustrated in the below code chunk

```{r}
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
target_columns = "age")
utils::head(sim_ebola_data)
```

## Epidemiology related operations
In addition to common data cleansing tasks, such as those discussed in the previous section, the {cleanepi} package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks.
### Dictionary-based substitution

### Calculating age at different time scales

### Calulating age categories

The results provides a summary of each column, including column names, data types, number of missing values, and summary statistics for numerical columns.
## Multiple operations at once
Loading
Loading