differences for PR #143

epiverse-trace · Oct 1, 2024 · fd0c6f3 · fd0c6f3
1 parent 54d4fb5
commit fd0c6f3
Show file tree

Hide file tree

Showing 4 changed files with 2,936 additions and 31 deletions.
diff --git a/config.yaml b/config.yaml
@@ -0,0 +1,83 @@
+#------------------------------------------------------------
+# Values for this lesson.
+#------------------------------------------------------------
+
+# Which carpentry is this (swc, dc, lc, or cp)?
+# swc: Software Carpentry
+# dc: Data Carpentry
+# lc: Library Carpentry
+# cp: Carpentries (to use for instructor training for instance)
+# incubator: The Carpentries Incubator
+carpentry: 'incubator'
+
+# Overall title for pages.
+title: 'Read and clean case data, and make linelist for outbreak analytics with R'
+
+# Date the lesson was created (YYYY-MM-DD, this is empty by default)
+created: 
+
+# Comma-separated list of keywords for the lesson
+keywords: 
+
+# Life cycle stage of the lesson
+# possible values: pre-alpha, alpha, beta, stable
+life_cycle: 'pre-alpha'
+
+# License of the lesson materials (recommended CC-BY 4.0)
+license: 'CC-BY 4.0'
+
+# Link to the source repository for this lesson
+source: 'https://github.com/epiverse-trace/tutorials-early'
+
+# Default branch of your lesson
+branch: 'main'
+
+# Who to contact if there are any issues
+contact: '[email protected]'
+
+# Navigation ------------------------------------------------
+#
+# Use the following menu items to specify the order of
+# individual pages in each dropdown section. Leave blank to
+# include all pages in the folder.
+#
+# Example -------------
+#
+# episodes:
+# - introduction.md
+# - first-steps.md
+#
+# learners:
+# - setup.md
+#
+# instructors:
+# - instructor-notes.md
+#
+# profiles:
+# - one-learner.md
+# - another-learner.md
+
+# Order of episodes in your lesson
+episodes: 
+- read-cases.Rmd
+- clean-data.Rmd
+- validate.Rmd
+- describe-cases.Rmd
+
+# Information for Learners
+learners: 
+
+# Information for Instructors
+instructors: 
+
+# Learner Profiles
+profiles: 
+
+# Customisation ---------------------------------------------
+#
+# This space below is where custom yaml items (e.g. pinning
+# sandpaper and varnish versions) should live
+
+
+varnish: epiverse-trace/varnish@epiversetheme
+sandpaper: epiverse-trace/sandpaper@patch-renv-github-bug
diff --git a/md5sum.txt b/md5sum.txt
@@ -4,7 +4,7 @@
 "config.yaml" "2459741082ec78d75a14dd77c7a79ff3" "site/built/config.yaml" "2024-10-01"
 "index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-10-01"
 "links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-10-01"
-"episodes/read-cases.Rmd" "fe84511fc9f9e53a32e97eaddd50085e" "site/built/read-cases.md" "2024-10-01"
+"episodes/read-cases.Rmd" "0cc3d1167d456a2abb3a4e36b568467b" "site/built/read-cases.md" "2024-10-01"
 "episodes/clean-data.Rmd" "355f1c880ed37e616c6de248bf5fffc4" "site/built/clean-data.md" "2024-10-01"
 "episodes/validate.Rmd" "2c9cff27170992bd479f827fbee4d623" "site/built/validate.md" "2024-10-01"
 "episodes/describe-cases.Rmd" "1ce7ad65092aa8fa51158e1387576339" "site/built/describe-cases.md" "2024-10-01"

diff --git a/read-cases.md b/read-cases.md
@@ -89,6 +89,22 @@ ebola_confirmed
 
 Similarly, you can import files of other formats such as `tsv`, `xlsx`, ... etc.
 
+:::::::::::::::::::: checklist
+
+### Why should we use the {here} package?
+
+The `{here}` package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. Here are the three key reasons to use it:
+
+- **Relative Paths**: Allows you to use relative file paths with respect to the `R` Project, making your code more portable and less error-prone.
+
+- **Cross-Environment Compatibility**: Works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths. This notation `here::here("data", "ebola_cases.csv")` avoids using `"data\ebola_cases.csv"` in some and `"data/ebola_cases.csv"` in others!
+
+- **Reduces Errors**: Avoids the need to use `setwd()` or absolute paths, reducing errors in scripts shared across machines. This avoids notations like `"C:/Users/mycomputer/Documents/projects/helloworld"`.
+
+The `{here}` package is ideal for adding one more layer of reproducibility to your work. If you are interested in reproducibility, we invite you to [read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R](https://epiverse-trace.github.io/research-compendium/)
+
+::::::::::::::::::::
+
 ::::::::::::::::::::::::::::::::: challenge
 
 ###  Reading compressed data 
@@ -129,72 +145,156 @@ rio::import(here::here("data", "Marburg.zip"))
 The [DBI](https://dbi.r-dbi.org/) package serves as a versatile interface for interacting with database management 
 systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
 
+::::::::::::: discuss
 
-The following code chunk demonstrates how to create a temporary SQLite database in memory, store the `ebola_confirmed` as a table on it, and subsequently read it:
+### When to read directly from a database?
+
+We can use database interface packages to optimize memory usage. If we process the database with "queries" (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system can lead to occupying more disk space than desired running out of memory.
+
+:::::::::::::
+
+The following code chunk demonstrates in four steps how to create a temporary SQLite database in memory, store the `ebola_confirmed` as a table on it, and subsequently read it:
+
+### 1. Connect with a database
+
+First, we establish a connection to an SQLite database created in memory using `DBI::dbConnect()`. 
 
 
 ``` r
 library(DBI)
 library(RSQLite)
 
 # Create a temporary SQLite database in memory
-db_con <- DBI::dbConnect(
+db_connection <- DBI::dbConnect(
   drv = RSQLite::SQLite(),
   dbname = ":memory:"
 )
+```
+
+::::::::::::::::: callout
 
+A real-life connection would look like this:
+
+```r
+# in real-life
+db_connection <- DBI::dbConnect(
+  RSQLite::SQLite(), 
+  host = "database.epiversetrace.com",
+  user = "juanito",
+  password = epiversetrace::askForPassword("Database password")
+)
+```
+
+:::::::::::::::::
+
+### 2. Write a local data frame as a table in a database
+
+Then, we can write the `ebola_confirmed` into a table named `cases` within the database using the `DBI::dbWriteTable()` function.
+
+
+``` r
 # Store the 'ebola_confirmed' dataframe as a table named 'cases'
 # in the SQLite database
 DBI::dbWriteTable(
-  conn = db_con,
+  conn = db_connection,
   name = "cases",
   value = ebola_confirmed
 )
+```
 
-# Read data from the 'cases' table
-result <- DBI::dbReadTable(
-  conn = db_con,
-  name = "cases"
-)
+In a database framework, you can have more than one table. Each table can belong to a specific `entity` (e.g., patients, care units, jobs). All tables will be related by a common ID or `primary key`.
 
-# Close the database connection
-DBI::dbDisconnect(conn = db_con)
+### 3. Read data from a table in a database
 
-# View the result
-result %>%
-  dplyr::as_tibble() # for a simple data frame output
+<!-- Subsequently, we reads the data from the `cases` table using `DBI::dbReadTable()`. -->
+
+<!-- ```{r,warning=FALSE,message=FALSE} -->
+<!-- # Read data from the 'cases' table -->
+<!-- extracted_data <- DBI::dbReadTable( -->
+<!--   conn = db_connection, -->
+<!--   name = "cases" -->
+<!-- ) -->
+<!-- ``` -->
+
+Subsequently, we reads the data from the `cases` table using `dplyr::tbl()`.
+
+
+``` r
+# Read one table from the database
+mytable_db <- dplyr::tbl(src = db_connection, "cases")
+```
+
+If we apply `{dplyr}` verbs to this database SQLite table, these verbs will be translated to SQL queries.
+
+
+``` r
+# Show the SQL queries translated
+mytable_db %>%
+  dplyr::filter(confirm > 50) %>%
+  dplyr::arrange(desc(confirm)) %>%
+  dplyr::show_query()
 ```
 
 ``` output
-# A tibble: 120 × 2
-    date confirm
-   <int>   <int>
- 1 16208       1
- 2 16210       2
- 3 16211       4
- 4 16212       6
- 5 16213       1
- 6 16214       2
- 7 16216      10
- 8 16217       8
- 9 16218       2
-10 16219      12
-# ℹ 110 more rows
+<SQL>
+SELECT `cases`.*
+FROM `cases`
+WHERE (`confirm` > 50.0)
+ORDER BY `confirm` DESC
+```
+
+### 4. Extract data from the database
+
+Use `dplyr::collect()` to force computation of a database query and extract the output to your local computer.
+
+
+``` r
+# Pull all data down to a local tibble
+extracted_data <- mytable_db %>%
+  dplyr::filter(confirm > 50) %>%
+  dplyr::arrange(desc(confirm)) %>%
+  dplyr::collect()
+```
+
+The `extracted_data` object represents the extracted, ideally after specifying queries that reduces its size.
+
+
+``` r
+# View the extracted_data
+extracted_data %>%
+  dplyr::as_tibble() # for a simple data frame output
 ```
 
-This code first establishes a connection to an SQLite database created in memory using `dbConnect()`. Then, it writes the `ebola_confirmed` into a table named 'cases' within the database using the `dbWriteTable()` function. Subsequently, it reads the data from the 'cases' table using `dbReadTable()`. Finally, it closes the database connection with `dbDisconnect()`.
+``` output
+# A tibble: 3 × 2
+   date confirm
+  <int>   <int>
+1 16329      84
+2 16328      68
+3 16330      56
+```
 
 :::::::::::::::::::::: callout
 
 ### Run SQL queries in R using dbplyr
 
-We can use database interface packages to optimize memory usage. If we process the database with "queries" (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system can lead to occupying more disk space than desired running out of memory.
+Practice how to make relational database SQL queries using multiple `{dplyr}` verbs like `dplyr::left_join()` among tables before pulling down data to your local session with `dplyr::collect()`! 
 
-Read this [tutorial episode on SQL databases and R](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html#complex-database-queries) to practice how to make relational database SQL queries using multiple {dplyr} verbs like `left_join()` among tables before pulling down data to your local session with `collect()`!
+You can also review the `{dbplyr}` R package. But for a step-by-step tutorial about SQL, we recommend you this [tutorial about data management with SQL for Ecologist](https://datacarpentry.org/sql-ecology-lesson/). You will find close to `{dplyr}`!
 
 ::::::::::::::::::::::
 
 
+### 5. Close the database connection
+
+Finally, we can close the database connection with `dbDisconnect()`.
+
+
+``` r
+# Close the database connection
+DBI::dbDisconnect(conn = db_connection)
+```
+
 ## Reading from HIS APIs
 
 Health related data are also increasingly stored in specialized HIS APIs like **Fingertips**, **GoData**, **REDCap**, and **DHIS2**. In such case one can resort to [readepi](https://epiverse-trace.github.io/readepi/) package, which enables reading  data from HIS-APIs.