Skip to content

Commit

Permalink
differences for PR #39
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Apr 1, 2024
1 parent d02aaa0 commit d996d68
Show file tree
Hide file tree
Showing 12 changed files with 530 additions and 5 deletions.
73 changes: 73 additions & 0 deletions clean-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: 'Clean outbreaks data'
teaching: 10
exercises: 2
---

:::::::::::::::::::::::::::::::::::::: questions

- How to clean and standardize case data?
- How to convert raw dataset into a `linelist` object?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Explain how clean, curate, and standardize case data using `{cleanepi}` package
- Demonstrate how to covert case data to a `linelist` object

::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction
In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.


The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content.


```r
requireNamespace("rio", quietly = TRUE)
sim_ebola_data <- rio::import(file.path("data", "simulated_ebola.csv",
fsep = "/"))
utils::head(sim_ebola_data, 5)
```

```{.output}
case id age gender status date onset date sample district cheifdom
1 14905 90 1 confirmed 03/15/2015 06/04/2015 M D
2 13043 twenty-five 2 Sep /11/Y 03/01/2014 W X
3 14364 54 f <NA> 09/02/2014 03/03/2015 N M
4 14675 ninety <NA> 10/19/2014 31/ 12 /14 D W
5 12648 74 F 08/06/2014 10/10/2016 V Y
age_cat
1 65+
2 twenty-five
3 25-64
4 ninety
5 65+
```

## Quick inspection
Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:


```r
requireNamespace("cleanepi", quietly = TRUE)
cleanepi::scan_data(sim_ebola_data)
```

```{.output}
Field_names missing numeric date character logical
1 case id 0.000000 1.0000 0.000000 0.000000 0
2 age 0.064600 0.8348 0.000000 0.100600 0
3 gender 0.157867 0.0472 0.000000 0.794933 0
4 status 0.053533 0.0000 0.000000 0.946467 0
5 date onset 0.000067 0.0000 0.915733 0.084200 0
6 date sample 0.000133 0.0000 0.999867 0.000000 0
7 district 0.000000 0.0000 0.000000 1.000000 0
8 cheifdom 0.000000 0.0000 0.000000 1.000000 0
9 age_cat 0.064600 0.0000 0.000000 0.935400 0
```


The results provides a summary of each column, including column names, data types, number of missing values, and summary statistics for numerical columns.
8 changes: 4 additions & 4 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ contact: '[email protected]'

# Order of episodes in your lesson
episodes:
#- read-cases.Rmd
#- clean-data.Rmd
#- describe-cases.Rmd
#- simple-analysis.Rmd
- read-cases.Rmd
- clean-data.Rmd
- describe-cases.Rmd
- simple-analysis.Rmd
- delays-reuse.Rmd
- delays-functions.Rmd

Expand Down
136 changes: 136 additions & 0 deletions describe-cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: 'Aggregate and visulaize'
teaching: 10
exercises: 2
---

:::::::::::::::::::::::::::::::::::::: questions

- How to aggregate case data?
- How to visualize aggregated data?
- What is distribution of cases in time, place, gender, age?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Convert case data to incidence
- Create epidemic curves from incidence data
- Estimate the growth rate from incidence curves
- Create quick descriptive and comparison tables

::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction

A comprehensive description of data is pivotal for conducting insightful explanatory and exploratory analyses. This episode focuses on describing and visualizing epidemic data. The examples are built around the **Covid-19 case data from England** dataset that contained in the [outbreaks](http://www.reconverse.org/outbreaks/) package. The first setp is to read this dataset, and we recommend utilizing the [readr](../links.md#readr) package for this purpose (or employing alternative methods as outlined in the [Read case data](../episodes/read-cases.Rmd) episode.


```r
requireNamespace("outbreaks", quietly = TRUE)
covid19_eng_case_data <- outbreaks::covid19_england_nhscalls_2020
utils::head(covid19_eng_case_data, 5)
```

```{.output}
site_type date sex age ccg_code
1 111 2020-03-18 female missing e38000062
2 111 2020-03-18 female missing e38000163
3 111 2020-03-18 female 0-18 e38000001
4 111 2020-03-18 female 0-18 e38000002
5 111 2020-03-18 female 0-18 e38000004
ccg_name count postcode
1 nhs_gloucestershire_ccg 1 gl34fe
2 nhs_south_tyneside_ccg 1 ne325nn
3 nhs_airedale_wharfedale_and_craven_ccg 8 bd57jr
4 nhs_ashford_ccg 7 tn254ab
5 nhs_barking_and_dagenham_ccg 35 rm13ae
nhs_region day weekday
1 South West 0 rest_of_week
2 North East and Yorkshire 0 rest_of_week
3 North East and Yorkshire 0 rest_of_week
4 South East 0 rest_of_week
5 London 0 rest_of_week
```

## Incidence data

Downstream analysis involves working with aggregated data rather than individual cases. This requires aggregating case data and creating incidence data. The [incidence2]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers an essential function, called `incidence`, for grouping case data, usually centered around dated events and/or other factors. The code chunk provided below demonstrates the creation of an `incidence2` object from the `covid19_eng_case_data` based on the date of samples.


```r
requireNamespace("incidence2", quietly = TRUE)
covid19_eng_incidence_data <- incidence2::incidence(
covid19_eng_case_data,
date_index = "date"
)
utils::head(covid19_eng_incidence_data, 5)
```

```{.output}
# incidence: 5 x 3
# count vars: date
date_index count_variable count
* <date> <chr> <int>
1 2020-03-18 date 2579
2 2020-03-19 date 2602
3 2020-03-20 date 2615
4 2020-03-21 date 2588
5 2020-03-22 date 2603
```

The `incidence2` object can be visualized using the `plot()` function from base R package.


```r
base::plot(covid19_eng_incidence_data)
```

<img src="fig/describe-cases-rendered-unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

Moreover, `{incidence2}` has other functions that allow for aggregating case data based on a dated event and other factors such as the individual gender, the sampling location, etc. In the example below, we calculate weekly counts of Covid-19 cases in England grouping them by `sex` type.


```r
weekly_covid19_eng_incidence <- incidence2::incidence(
covid19_eng_case_data,
date_index = "date",
interval = "week",
groups = "sex"
)
base::plot(weekly_covid19_eng_incidence, angle = 45)
```

<img src="fig/describe-cases-rendered-unnamed-chunk-4-1.png" style="display: block; margin: auto;" />



::::::::::::::::::::::::::::::::::::: challenge

## Challenge 1: Can you do it?

- Using the above `covid91_eng_case_data` dataset, produce monthly epi-curves for Covid-19 cases in England based on regional places in England?

:::::::::::::::::::::::: solution


```r
monthly_covid19_eng_incidence <- incidence2::incidence(
covid19_eng_case_data,
date_index = "date",
interval = "month",
groups = "sex"
)
base::plot(monthly_covid19_eng_incidence, angle = 45)
```

:::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::


::::::::::::::::::::::::::::::::::::: keypoints

- Use `{incidence2}` to aggregate case data based on a date event.
::::::::::::::::::::::::::::::::::::::::::::::::

Binary file added fig/describe-cases-rendered-unnamed-chunk-3-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/describe-cases-rendered-unnamed-chunk-4-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion md5sum.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
"file" "checksum" "built" "date"
"CODE_OF_CONDUCT.md" "549f00b0992a7743c2bc16ea6ce3db57" "site/built/CODE_OF_CONDUCT.md" "2024-03-28"
"LICENSE.md" "14377518ee654005a18cf28549eb30e3" "site/built/LICENSE.md" "2024-03-28"
"config.yaml" "cd36225fa14f3e67eae7aae27bddd294" "site/built/config.yaml" "2024-03-28"
"config.yaml" "437ef25251f4b72151e2a2470846f1ae" "site/built/config.yaml" "2024-04-01"
"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-03-28"
"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-03-28"
"episodes/read-cases.Rmd" "c340765bc592b7b0dac4c0835da1e4d3" "site/built/read-cases.md" "2024-04-01"
"episodes/clean-data.Rmd" "b3b9b7dfe8d81ebbb32a7d4a48f55062" "site/built/clean-data.md" "2024-04-01"
"episodes/describe-cases.Rmd" "8a79897e610d379b568579673f1fedab" "site/built/describe-cases.md" "2024-04-01"
"episodes/simple-analysis.Rmd" "6285471aee2e99f796ff20bf6ccfb5a2" "site/built/simple-analysis.md" "2024-04-01"
"episodes/delays-reuse.Rmd" "f0f01aa200908903fd18ca72cff0eac7" "site/built/delays-reuse.md" "2024-03-28"
"episodes/delays-functions.Rmd" "1b8c594905ee34befa02f1912256b37f" "site/built/delays-functions.md" "2024-03-28"
"instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2024-03-28"
Expand Down
132 changes: 132 additions & 0 deletions read-cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: 'Read case data'
teaching: 20
exercises: 2
editor_options:
chunk_output_type: inline
---

:::::::::::::::::::::::::::::::::::::: questions
- Where do you usually store your outbreak data?
- How many different data formats can I read?
- Is it possible to import data from database and health APIs?
::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Explain how to import outbreak data from different sources into `R`
environment for analysis.
::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: prereq

## Prerequisites

This episode requires you to be familiar with:

**Data science** : Basic programming with R.
:::::::::::::::::::::::::::::::::

## Introduction

The initial step in outbreak analysis involves importing the target dataset into the `R` environment from various sources. Outbreak data is typically stored in files of diverse formats, relational database management systems (RDBMS), or health information system (HIS) application program interfaces (APIs) such as [REDCap](https://www.project-redcap.org/) and [DHIS2](https://dhis2.org/). The latter option is particularly well-suited for storing institutional health data. This episode will elucidate the process of reading cases from these sources.

## Reading from files

Several packages are available for importing outbreak data stored in individual files into `R`. These include [rio](http://gesistsa.github.io/rio/), [readr](https://readr.tidyverse.org/) from the `tidyverse`, [io](https://bitbucket.org/djhshih/io/src/master/), and [ImportExport](https://cran.r-project.org/web/packages/ImportExport/index.html). Together, these packages offer methods to read single or multiple files in a wide range of formats.

The below example shows how to import a `csv` file into `R` environment using `rio` package.

```r
requireNamespace("rio", quietly = TRUE)
case_data <- rio::import(file.path("data", "ebola_cases.csv", fsep = "/"))
head(case_data, 5)
```

```{.output}
date confirm
1 2014-05-18 1
2 2014-05-20 2
3 2014-05-21 4
4 2014-05-22 6
5 2014-05-23 1
```

Similarly, you can import files of other formats such as `tsv`, `xlsx`, etc.

::::::::::::::::::::::::::::::::: challenge

### Reading compressed data
Take 1 minute:
- Is it possible to read compressed data in `R`?

::::::::::::::::: hint

You can check the supported file formats in the `{rio}` package as follows:

```r
requireNamespace("rio", quietly = TRUE)
rio::install_formats()
```

::::::::::::::::::::::

::::::::::::::::: solution


```r
requireNamespace("rio", quietly = TRUE)
rio::import(file.path("path_name", "file_name.zip", fsep = "/"))
```

::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::


## Reading from databases

The [DBI](https://dbi.r-dbi.org/) package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.


The following code chunk demonstrates how to create a temporary SQLite database in memory, store the `case_data` dataframe as a table within it, and subsequently read from it:


```r
requireNamespace("DBI", quietly = TRUE)
requireNamespace("RSQLite", quietly = TRUE)
# Create a temporary SQLite database in memory
db_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Store the 'case_data' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(db_con, "cases", case_data)
# Read data from the 'cases' table
result <- DBI::dbReadTable(db_con, "cases")
# Close the database connection
DBI::dbDisconnect(db_con)
# View the result
base::print(utils::head(result))
```

```{.output}
date confirm
1 16208 1
2 16210 2
3 16211 4
4 16212 6
5 16213 1
6 16214 2
```

This code first establishes a connection to an SQLite database created in memory using `dbConnect` function. Then, it writes the `case_data` dataframe into a table named 'cases' within the database using `dbWriteTable` function. Subsequently, it reads the data from the 'cases' table using `dbReadTable` function. Finally, it closes the database connection with `dbDisconnect` function. More examples about SQL databases and R can be found [here](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html).

## Reading from HIS APIs

Health related data are also increasingly stored in specialized HIS APIs like **Fingertips**, **GoData**, **REDCap**, and **DHIS2**. In such case one can resort to [readepi](https://epiverse-trace.github.io/readepi/) package, which enables reading data from HIS-APIs.
-[TBC]

::::::::::::::::::::::::::::::::::::: keypoints
- Use `{rio}, {io}, {readr}` and `{ImportExport}` to read data from individual files.
- Use `{DBI}` to read data from databases.
- Use `{readepi}` to read data form HIS APIs.
::::::::::::::::::::::::::::::::::::::::::::::::
Loading

0 comments on commit d996d68

Please sign in to comment.