differences for PR #39

epiverse-trace · Apr 1, 2024 · d996d68 · d996d68
1 parent d02aaa0
commit d996d68
Show file tree

Hide file tree

Showing 12 changed files with 530 additions and 5 deletions.
diff --git a/clean-data.md b/clean-data.md
@@ -0,0 +1,73 @@
+---
+title: 'Clean outbreaks data'
+teaching: 10
+exercises: 2
+---
+
+:::::::::::::::::::::::::::::::::::::: questions 
+
+- How to clean and standardize case data?
+- How to convert raw dataset into a `linelist` object?
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: objectives
+
+- Explain how clean, curate, and standardize case data using `{cleanepi}` package
+- Demonstrate how to covert case data to a `linelist` object 
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Introduction
+In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated,  standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks  data using the [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, and validate it using the [linelist](https://epiverse-trace.github.io/linelist/) package. For demonstration purposes, we'll work with a simulated dataset of Ebola cases.
+
+
+The first step is to import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into our environment and view its structure and content. 
+
+
+```r
+requireNamespace("rio", quietly = TRUE)
+sim_ebola_data <- rio::import(file.path("data", "simulated_ebola.csv",
+                                        fsep = "/"))
+utils::head(sim_ebola_data, 5)
+```
+
+```{.output}
+  case id         age gender    status date onset date sample district cheifdom
+1   14905          90      1 confirmed 03/15/2015  06/04/2015        M        D
+2   13043 twenty-five      2            Sep /11/Y  03/01/2014        W        X
+3   14364          54      f      <NA> 09/02/2014  03/03/2015        N        M
+4   14675      ninety   <NA>           10/19/2014  31/ 12 /14        D        W
+5   12648          74      F           08/06/2014  10/10/2016        V        Y
+      age_cat
+1         65+
+2 twenty-five
+3       25-64
+4      ninety
+5         65+
+```
+
+##  Quick inspection
+Quick exploration and inspection of the dataset are crucial before diving into any  analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:
+
+
+```r
+requireNamespace("cleanepi", quietly = TRUE)
+cleanepi::scan_data(sim_ebola_data)
+```
+
+```{.output}
+  Field_names  missing numeric     date character logical
+1     case id 0.000000  1.0000 0.000000  0.000000       0
+2         age 0.064600  0.8348 0.000000  0.100600       0
+3      gender 0.157867  0.0472 0.000000  0.794933       0
+4      status 0.053533  0.0000 0.000000  0.946467       0
+5  date onset 0.000067  0.0000 0.915733  0.084200       0
+6 date sample 0.000133  0.0000 0.999867  0.000000       0
+7    district 0.000000  0.0000 0.000000  1.000000       0
+8    cheifdom 0.000000  0.0000 0.000000  1.000000       0
+9     age_cat 0.064600  0.0000 0.000000  0.935400       0
+```
+
+
+The results provides a summary of each column, including column names, data types, number of missing values, and summary statistics for numerical columns.
diff --git a/config.yaml b/config.yaml
@@ -59,10 +59,10 @@ contact: '[email protected]'
 
 # Order of episodes in your lesson
 episodes:
-#- read-cases.Rmd
-#- clean-data.Rmd
-#- describe-cases.Rmd
-#- simple-analysis.Rmd
+- read-cases.Rmd
+- clean-data.Rmd
+- describe-cases.Rmd
+- simple-analysis.Rmd
 - delays-reuse.Rmd
 - delays-functions.Rmd
 

diff --git a/describe-cases.md b/describe-cases.md
@@ -0,0 +1,136 @@
+---
+title: 'Aggregate and visulaize'
+teaching: 10
+exercises: 2
+---
+
+:::::::::::::::::::::::::::::::::::::: questions 
+
+- How to aggregate case data? 
+- How to visualize aggregated data?
+- What is distribution of cases in time, place, gender, age?
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: objectives
+
+- Convert case data to incidence 
+- Create epidemic curves from incidence data
+- Estimate the growth rate from incidence curves
+- Create quick descriptive and comparison tables
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Introduction
+
+A comprehensive description of data is pivotal for conducting insightful explanatory and exploratory analyses. This episode focuses on describing and visualizing epidemic data. The examples are built around the **Covid-19 case data from England** dataset that contained in the [outbreaks](http://www.reconverse.org/outbreaks/) package. The first setp is to  read this dataset, and we recommend utilizing the [readr](../links.md#readr) package for this purpose (or employing alternative methods as outlined in the [Read case data](../episodes/read-cases.Rmd) episode.
+
+
+```r
+requireNamespace("outbreaks", quietly = TRUE)
+covid19_eng_case_data <- outbreaks::covid19_england_nhscalls_2020
+utils::head(covid19_eng_case_data, 5)
+```
+
+```{.output}
+  site_type       date    sex     age  ccg_code
+1       111 2020-03-18 female missing e38000062
+2       111 2020-03-18 female missing e38000163
+3       111 2020-03-18 female    0-18 e38000001
+4       111 2020-03-18 female    0-18 e38000002
+5       111 2020-03-18 female    0-18 e38000004
+                                ccg_name count postcode
+1                nhs_gloucestershire_ccg     1   gl34fe
+2                 nhs_south_tyneside_ccg     1  ne325nn
+3 nhs_airedale_wharfedale_and_craven_ccg     8   bd57jr
+4                        nhs_ashford_ccg     7  tn254ab
+5           nhs_barking_and_dagenham_ccg    35   rm13ae
+                nhs_region day      weekday
+1               South West   0 rest_of_week
+2 North East and Yorkshire   0 rest_of_week
+3 North East and Yorkshire   0 rest_of_week
+4               South East   0 rest_of_week
+5                   London   0 rest_of_week
+```
+
+## Incidence data
+
+Downstream analysis involves working with aggregated data rather than individual cases. This requires aggregating case data and creating incidence data. The [incidence2]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers an essential function, called `incidence`, for grouping case data, usually centered around dated events and/or other factors. The code chunk provided below demonstrates the creation of an `incidence2` object from the `covid19_eng_case_data` based on the date of samples.
+
+
+```r
+requireNamespace("incidence2", quietly = TRUE)
+covid19_eng_incidence_data <- incidence2::incidence(
+  covid19_eng_case_data,
+  date_index = "date"
+)
+utils::head(covid19_eng_incidence_data, 5)
+```
+
+```{.output}
+# incidence:  5 x 3
+# count vars: date
+  date_index count_variable count
+* <date>     <chr>          <int>
+1 2020-03-18 date            2579
+2 2020-03-19 date            2602
+3 2020-03-20 date            2615
+4 2020-03-21 date            2588
+5 2020-03-22 date            2603
+```
+
+The `incidence2` object can be visualized using the `plot()` function from base R package. 
+
+
+```r
+base::plot(covid19_eng_incidence_data)
+```
+
+<img src="fig/describe-cases-rendered-unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
+
+Moreover, `{incidence2}` has other functions that allow for aggregating case data based on a dated event and other factors such as the individual gender, the sampling location, etc. In the example below, we calculate weekly counts of Covid-19 cases in England grouping them by `sex` type.
+
+
+```r
+weekly_covid19_eng_incidence <- incidence2::incidence(
+  covid19_eng_case_data,
+  date_index = "date",
+  interval = "week",
+  groups = "sex"
+)
+base::plot(weekly_covid19_eng_incidence, angle = 45)
+```
+
+<img src="fig/describe-cases-rendered-unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
+
+
+
+::::::::::::::::::::::::::::::::::::: challenge 
+
+## Challenge 1: Can you do it?
+
+ - Using the above `covid91_eng_case_data`  dataset, produce monthly epi-curves for Covid-19 cases in England based on regional places in England?
+
+:::::::::::::::::::::::: solution 
+
+
+```r
+monthly_covid19_eng_incidence <- incidence2::incidence(
+  covid19_eng_case_data,
+  date_index = "date",
+  interval = "month",
+  groups = "sex"
+)
+base::plot(monthly_covid19_eng_incidence, angle = 45)
+```
+
+:::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+::::::::::::::::::::::::::::::::::::: keypoints 
+
+- Use `{incidence2}` to aggregate case data based on a date event.  
+::::::::::::::::::::::::::::::::::::::::::::::::
+
diff --git a/fig/describe-cases-rendered-unnamed-chunk-3-1.png b/fig/describe-cases-rendered-unnamed-chunk-3-1.png
diff --git a/fig/describe-cases-rendered-unnamed-chunk-4-1.png b/fig/describe-cases-rendered-unnamed-chunk-4-1.png
diff --git a/fig/simple-analysis-rendered-unnamed-chunk-1-1.png b/fig/simple-analysis-rendered-unnamed-chunk-1-1.png
diff --git a/fig/simple-analysis-rendered-unnamed-chunk-2-1.png b/fig/simple-analysis-rendered-unnamed-chunk-2-1.png
diff --git a/fig/simple-analysis-rendered-unnamed-chunk-5-1.png b/fig/simple-analysis-rendered-unnamed-chunk-5-1.png
diff --git a/fig/simple-analysis-rendered-unnamed-chunk-6-1.png b/fig/simple-analysis-rendered-unnamed-chunk-6-1.png
diff --git a/md5sum.txt b/md5sum.txt
@@ -1,9 +1,13 @@
 "file" "checksum" "built" "date"
 "CODE_OF_CONDUCT.md" "549f00b0992a7743c2bc16ea6ce3db57" "site/built/CODE_OF_CONDUCT.md" "2024-03-28"
 "LICENSE.md" "14377518ee654005a18cf28549eb30e3" "site/built/LICENSE.md" "2024-03-28"
-"config.yaml" "cd36225fa14f3e67eae7aae27bddd294" "site/built/config.yaml" "2024-03-28"
+"config.yaml" "437ef25251f4b72151e2a2470846f1ae" "site/built/config.yaml" "2024-04-01"
 "index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-03-28"
 "links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-03-28"
+"episodes/read-cases.Rmd" "c340765bc592b7b0dac4c0835da1e4d3" "site/built/read-cases.md" "2024-04-01"
+"episodes/clean-data.Rmd" "b3b9b7dfe8d81ebbb32a7d4a48f55062" "site/built/clean-data.md" "2024-04-01"
+"episodes/describe-cases.Rmd" "8a79897e610d379b568579673f1fedab" "site/built/describe-cases.md" "2024-04-01"
+"episodes/simple-analysis.Rmd" "6285471aee2e99f796ff20bf6ccfb5a2" "site/built/simple-analysis.md" "2024-04-01"
 "episodes/delays-reuse.Rmd" "f0f01aa200908903fd18ca72cff0eac7" "site/built/delays-reuse.md" "2024-03-28"
 "episodes/delays-functions.Rmd" "1b8c594905ee34befa02f1912256b37f" "site/built/delays-functions.md" "2024-03-28"
 "instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2024-03-28"

diff --git a/read-cases.md b/read-cases.md
@@ -0,0 +1,132 @@
+---
+title: 'Read case data'
+teaching: 20
+exercises: 2
+editor_options: 
+  chunk_output_type: inline
+---
+
+:::::::::::::::::::::::::::::::::::::: questions 
+- Where do you usually store your outbreak data?
+- How many different data formats can I read? 
+- Is it possible to import data from database and health APIs? 
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: objectives
+
+- Explain how to import outbreak data from different sources into `R` 
+environment for analysis.
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: prereq
+
+## Prerequisites
+
+This episode requires you to be familiar with:
+
+**Data science** : Basic programming with R.
+:::::::::::::::::::::::::::::::::
+
+## Introduction
+
+The initial step in outbreak analysis involves importing the target dataset into the `R` environment from various sources. Outbreak data is typically stored in files of diverse formats,  relational database management systems (RDBMS), or health information system (HIS) application program interfaces (APIs) such as [REDCap](https://www.project-redcap.org/) and [DHIS2](https://dhis2.org/). The latter  option is particularly well-suited for storing institutional health data. This episode will elucidate the process of reading cases from these sources.
+
+## Reading from files 
+
+Several packages are available for importing outbreak data stored in individual files into `R`. These include [rio](http://gesistsa.github.io/rio/), [readr](https://readr.tidyverse.org/) from the `tidyverse`, [io](https://bitbucket.org/djhshih/io/src/master/), and [ImportExport](https://cran.r-project.org/web/packages/ImportExport/index.html). Together, these packages offer methods to read single or multiple files in a wide range of formats.
+
+The below example shows how to import a `csv` file into `R` environment using `rio` package.
+
+```r
+requireNamespace("rio", quietly = TRUE)
+case_data <- rio::import(file.path("data", "ebola_cases.csv", fsep = "/"))
+head(case_data, 5)
+```
+
+```{.output}
+        date confirm
+1 2014-05-18       1
+2 2014-05-20       2
+3 2014-05-21       4
+4 2014-05-22       6
+5 2014-05-23       1
+```
+
+Similarly, you can import files of other formats such as `tsv`, `xlsx`, etc.
+
+::::::::::::::::::::::::::::::::: challenge
+
+###  Reading compressed data 
+Take 1 minute:
+- Is it possible to read compressed data in `R`?
+
+::::::::::::::::: hint
+
+You can check the supported file formats in the `{rio}` package as follows:
+
+```r
+requireNamespace("rio", quietly = TRUE)
+rio::install_formats()
+```
+
+::::::::::::::::::::::
+
+::::::::::::::::: solution
+
+
+```r
+requireNamespace("rio", quietly = TRUE)
+rio::import(file.path("path_name", "file_name.zip", fsep = "/"))
+```
+
+::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::::::
+
+
+## Reading from databases
+
+The [DBI](https://dbi.r-dbi.org/) package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
+
+
+The following code chunk demonstrates how to create a temporary SQLite database in memory, store the `case_data` dataframe as a table within it, and subsequently read from it:
+
+
+```r
+requireNamespace("DBI", quietly = TRUE)
+requireNamespace("RSQLite", quietly = TRUE)
+# Create a temporary SQLite database in memory
+db_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
+# Store the 'case_data' dataframe as a table named 'cases'
+# in the SQLite database
+DBI::dbWriteTable(db_con, "cases", case_data)
+# Read data from the 'cases' table
+result <- DBI::dbReadTable(db_con, "cases")
+# Close the database connection
+DBI::dbDisconnect(db_con)
+# View the result
+base::print(utils::head(result))
+```
+
+```{.output}
+   date confirm
+1 16208       1
+2 16210       2
+3 16211       4
+4 16212       6
+5 16213       1
+6 16214       2
+```
+
+This code first establishes a connection to an SQLite database created in memory using `dbConnect` function. Then, it writes the `case_data` dataframe into a table named 'cases' within the database using `dbWriteTable` function. Subsequently, it reads the data from the 'cases' table using `dbReadTable` function. Finally, it closes the database connection with `dbDisconnect` function. More examples about SQL databases and R can be found [here](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html).
+
+## Reading from HIS APIs
+
+Health related data are also increasingly stored in specialized HIS APIs like **Fingertips**, **GoData**, **REDCap**, and **DHIS2**. In such case one can resort to [readepi](https://epiverse-trace.github.io/readepi/) package, which enables reading  data from HIS-APIs.  
+-[TBC]
+
+::::::::::::::::::::::::::::::::::::: keypoints 
+- Use `{rio}, {io}, {readr}` and `{ImportExport}` to read data from individual files.
+- Use `{DBI}` to read data from databases.
+- Use `{readepi}` to read data form HIS APIs.
+::::::::::::::::::::::::::::::::::::::::::::::::