Skip to content

Commit

Permalink
differences for PR #143
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Oct 1, 2024
1 parent 54d4fb5 commit fd0c6f3
Show file tree
Hide file tree
Showing 4 changed files with 2,936 additions and 31 deletions.
83 changes: 83 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#------------------------------------------------------------
# Values for this lesson.
#------------------------------------------------------------

# Which carpentry is this (swc, dc, lc, or cp)?
# swc: Software Carpentry
# dc: Data Carpentry
# lc: Library Carpentry
# cp: Carpentries (to use for instructor training for instance)
# incubator: The Carpentries Incubator
carpentry: 'incubator'

# Overall title for pages.
title: 'Read and clean case data, and make linelist for outbreak analytics with R'

# Date the lesson was created (YYYY-MM-DD, this is empty by default)
created:

# Comma-separated list of keywords for the lesson
keywords:

# Life cycle stage of the lesson
# possible values: pre-alpha, alpha, beta, stable
life_cycle: 'pre-alpha'

# License of the lesson materials (recommended CC-BY 4.0)
license: 'CC-BY 4.0'

# Link to the source repository for this lesson
source: 'https://github.com/epiverse-trace/tutorials-early'

# Default branch of your lesson
branch: 'main'

# Who to contact if there are any issues
contact: '[email protected]'

# Navigation ------------------------------------------------
#
# Use the following menu items to specify the order of
# individual pages in each dropdown section. Leave blank to
# include all pages in the folder.
#
# Example -------------
#
# episodes:
# - introduction.md
# - first-steps.md
#
# learners:
# - setup.md
#
# instructors:
# - instructor-notes.md
#
# profiles:
# - one-learner.md
# - another-learner.md

# Order of episodes in your lesson
episodes:
- read-cases.Rmd
- clean-data.Rmd
- validate.Rmd
- describe-cases.Rmd

# Information for Learners
learners:

# Information for Instructors
instructors:

# Learner Profiles
profiles:

# Customisation ---------------------------------------------
#
# This space below is where custom yaml items (e.g. pinning
# sandpaper and varnish versions) should live


varnish: epiverse-trace/varnish@epiversetheme
sandpaper: epiverse-trace/sandpaper@patch-renv-github-bug
2 changes: 1 addition & 1 deletion md5sum.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"config.yaml" "2459741082ec78d75a14dd77c7a79ff3" "site/built/config.yaml" "2024-10-01"
"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-10-01"
"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-10-01"
"episodes/read-cases.Rmd" "fe84511fc9f9e53a32e97eaddd50085e" "site/built/read-cases.md" "2024-10-01"
"episodes/read-cases.Rmd" "0cc3d1167d456a2abb3a4e36b568467b" "site/built/read-cases.md" "2024-10-01"
"episodes/clean-data.Rmd" "355f1c880ed37e616c6de248bf5fffc4" "site/built/clean-data.md" "2024-10-01"
"episodes/validate.Rmd" "2c9cff27170992bd479f827fbee4d623" "site/built/validate.md" "2024-10-01"
"episodes/describe-cases.Rmd" "1ce7ad65092aa8fa51158e1387576339" "site/built/describe-cases.md" "2024-10-01"
Expand Down
160 changes: 130 additions & 30 deletions read-cases.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,22 @@ ebola_confirmed

Similarly, you can import files of other formats such as `tsv`, `xlsx`, ... etc.

:::::::::::::::::::: checklist

### Why should we use the {here} package?

The `{here}` package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. Here are the three key reasons to use it:

- **Relative Paths**: Allows you to use relative file paths with respect to the `R` Project, making your code more portable and less error-prone.

- **Cross-Environment Compatibility**: Works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths. This notation `here::here("data", "ebola_cases.csv")` avoids using `"data\ebola_cases.csv"` in some and `"data/ebola_cases.csv"` in others!

- **Reduces Errors**: Avoids the need to use `setwd()` or absolute paths, reducing errors in scripts shared across machines. This avoids notations like `"C:/Users/mycomputer/Documents/projects/helloworld"`.

The `{here}` package is ideal for adding one more layer of reproducibility to your work. If you are interested in reproducibility, we invite you to [read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R](https://epiverse-trace.github.io/research-compendium/)

::::::::::::::::::::

::::::::::::::::::::::::::::::::: challenge

### Reading compressed data
Expand Down Expand Up @@ -129,72 +145,156 @@ rio::import(here::here("data", "Marburg.zip"))
The [DBI](https://dbi.r-dbi.org/) package serves as a versatile interface for interacting with database management
systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.

::::::::::::: discuss

The following code chunk demonstrates how to create a temporary SQLite database in memory, store the `ebola_confirmed` as a table on it, and subsequently read it:
### When to read directly from a database?

We can use database interface packages to optimize memory usage. If we process the database with "queries" (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system can lead to occupying more disk space than desired running out of memory.

:::::::::::::

The following code chunk demonstrates in four steps how to create a temporary SQLite database in memory, store the `ebola_confirmed` as a table on it, and subsequently read it:

### 1. Connect with a database

First, we establish a connection to an SQLite database created in memory using `DBI::dbConnect()`.


``` r
library(DBI)
library(RSQLite)

# Create a temporary SQLite database in memory
db_con <- DBI::dbConnect(
db_connection <- DBI::dbConnect(
drv = RSQLite::SQLite(),
dbname = ":memory:"
)
```

::::::::::::::::: callout

A real-life connection would look like this:

```r
# in real-life
db_connection <- DBI::dbConnect(
RSQLite::SQLite(),
host = "database.epiversetrace.com",
user = "juanito",
password = epiversetrace::askForPassword("Database password")
)
```

:::::::::::::::::

### 2. Write a local data frame as a table in a database

Then, we can write the `ebola_confirmed` into a table named `cases` within the database using the `DBI::dbWriteTable()` function.


``` r
# Store the 'ebola_confirmed' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(
conn = db_con,
conn = db_connection,
name = "cases",
value = ebola_confirmed
)
```

# Read data from the 'cases' table
result <- DBI::dbReadTable(
conn = db_con,
name = "cases"
)
In a database framework, you can have more than one table. Each table can belong to a specific `entity` (e.g., patients, care units, jobs). All tables will be related by a common ID or `primary key`.

# Close the database connection
DBI::dbDisconnect(conn = db_con)
### 3. Read data from a table in a database

# View the result
result %>%
dplyr::as_tibble() # for a simple data frame output
<!-- Subsequently, we reads the data from the `cases` table using `DBI::dbReadTable()`. -->

<!-- ```{r,warning=FALSE,message=FALSE} -->
<!-- # Read data from the 'cases' table -->
<!-- extracted_data <- DBI::dbReadTable( -->
<!-- conn = db_connection, -->
<!-- name = "cases" -->
<!-- ) -->
<!-- ``` -->

Subsequently, we reads the data from the `cases` table using `dplyr::tbl()`.


``` r
# Read one table from the database
mytable_db <- dplyr::tbl(src = db_connection, "cases")
```

If we apply `{dplyr}` verbs to this database SQLite table, these verbs will be translated to SQL queries.


``` r
# Show the SQL queries translated
mytable_db %>%
dplyr::filter(confirm > 50) %>%
dplyr::arrange(desc(confirm)) %>%
dplyr::show_query()
```

``` output
# A tibble: 120 × 2
date confirm
<int> <int>
1 16208 1
2 16210 2
3 16211 4
4 16212 6
5 16213 1
6 16214 2
7 16216 10
8 16217 8
9 16218 2
10 16219 12
# ℹ 110 more rows
<SQL>
SELECT `cases`.*
FROM `cases`
WHERE (`confirm` > 50.0)
ORDER BY `confirm` DESC
```

### 4. Extract data from the database

Use `dplyr::collect()` to force computation of a database query and extract the output to your local computer.


``` r
# Pull all data down to a local tibble
extracted_data <- mytable_db %>%
dplyr::filter(confirm > 50) %>%
dplyr::arrange(desc(confirm)) %>%
dplyr::collect()
```

The `extracted_data` object represents the extracted, ideally after specifying queries that reduces its size.


``` r
# View the extracted_data
extracted_data %>%
dplyr::as_tibble() # for a simple data frame output
```

This code first establishes a connection to an SQLite database created in memory using `dbConnect()`. Then, it writes the `ebola_confirmed` into a table named 'cases' within the database using the `dbWriteTable()` function. Subsequently, it reads the data from the 'cases' table using `dbReadTable()`. Finally, it closes the database connection with `dbDisconnect()`.
``` output
# A tibble: 3 × 2
date confirm
<int> <int>
1 16329 84
2 16328 68
3 16330 56
```

:::::::::::::::::::::: callout

### Run SQL queries in R using dbplyr

We can use database interface packages to optimize memory usage. If we process the database with "queries" (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system can lead to occupying more disk space than desired running out of memory.
Practice how to make relational database SQL queries using multiple `{dplyr}` verbs like `dplyr::left_join()` among tables before pulling down data to your local session with `dplyr::collect()`!

Read this [tutorial episode on SQL databases and R](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html#complex-database-queries) to practice how to make relational database SQL queries using multiple {dplyr} verbs like `left_join()` among tables before pulling down data to your local session with `collect()`!
You can also review the `{dbplyr}` R package. But for a step-by-step tutorial about SQL, we recommend you this [tutorial about data management with SQL for Ecologist](https://datacarpentry.org/sql-ecology-lesson/). You will find close to `{dplyr}`!

::::::::::::::::::::::


### 5. Close the database connection

Finally, we can close the database connection with `dbDisconnect()`.


``` r
# Close the database connection
DBI::dbDisconnect(conn = db_connection)
```

## Reading from HIS APIs

Health related data are also increasingly stored in specialized HIS APIs like **Fingertips**, **GoData**, **REDCap**, and **DHIS2**. In such case one can resort to [readepi](https://epiverse-trace.github.io/readepi/) package, which enables reading data from HIS-APIs.
Expand Down
Loading

0 comments on commit fd0c6f3

Please sign in to comment.