-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path095-lazy-evalutation-and-timing.Rmd
163 lines (126 loc) · 6.83 KB
/
095-lazy-evalutation-and-timing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# Lazy Evaluation and Execution Environment {#chapter_lazy-evaluation-and-timing}
> This chapter:
>
> * Builds on the lazy loading discussion in the previous chapter
> * Demonstrates how the use of the `dplyr::collect()` creates a boundary between code that is sent to a dbms and code that is executed locally
## Setup
The following packages are used in this chapter:
```{r chapter package list, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
sleep_default <- 3
```
If you have not yet set up the Docker container with PostgreSQL and the dvdrental database, go back to [those instructions][Build the pet-sql Docker Image] to configure your environment. Otherwise, start your `adventureworks` container:
```{r check on adventureworks}
sqlpetr::sp_docker_start("adventureworks")
Sys.sleep(sleep_default)
```
Connect to the database:
```{r connect to postgresql}
con <- dbConnect(
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
host = "localhost",
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "adventureworks",
port = 5432)
```
Here is a simple string of `dplyr` verbs similar to the query used to illustrate issues in the last chapter:
Note that in the previous example we follow this book's convention of creating a connection object to each table and fully qualifying function names (e.g., specifying the package). In practice, it's possible and convenient to use more abbreviated notation.
```{r simplify the simpe join}
Q <- tbl(con, in_schema("sales", "salesperson")) %>%
left_join(tbl(con, in_schema("humanresources", "employee")), by = c("businessentityid" = "businessentityid")) %>%
select(birthdate, saleslastyear)
Q
```
### Experiment overview
Think of `Q` as a black box for the moment. The following examples will show how `Q` is interpreted differently by different functions. It's important to remember in the following discussion that the "**and then**" operator (`%>%`) actually wraps the subsequent code inside the preceding code so that `Q %>% print()` is equivalent to `print(Q)`.
**Notation**
> |Symbol|Explanation
> |----|-------------
> | ![](screenshots/green-check.png)| A single green check indicates that some rows are returned. <br>
> | ![](screenshots/green-check.png) ![](screenshots/green-check.png)| Two green checks indicate that all the rows are returned.
> | ![](screenshots/red-x.png) |The red X indicates that no rows are returned.
>
> R code | Result
> -------| --------------
> **Time-based, execution environment issues** |
> [`Qc <- Q %>% count(saleslastyear, sort = TRUE)`](#lazy_q_build) | ![](screenshots/red-x.png) **Extends** the lazy query object
>
>
The next chapter will discuss how to build queries and how to explore intermediate steps. But first, the following subsections provide a more detailed discussion of each row in the preceding table.
### Time-based, execution environment issues
Remember that if the expression is assigned to an object, it is not executed. If an expression is entered on the command line or appears in your script by itself, a `print()` function is implied.
> *These two are different:*
> Q %>% sum(saleslastyear)
> Q_query <- Q %>% sum(saleslastyear)
>
This behavior is the basis of a useful debugging and development process where queries are built up incrementally.
### Q %>% `more dplyr` {#lazy_q_build}
![](screenshots/green-check.png) Because the following statement implies a `print()` function at the end, we can run it repeatedly, adding dplyr expressions, and only get 10 rows back. Every time we add a dplyr expression to a chain, R will rewrite the SQL code. For example:
As we understand more about the data, we simply add dplyr expressions to pinpoint what we are looking for:
```{r three more dplyr}
Q %>% filter(saleslastyear > 40) %>%
arrange(desc(saleslastyear))
```
```{r one more dplyr}
Q %>% summarize(total_sales = sum(saleslastyear, na.rm = TRUE), sales_persons_count = n())
```
![](screenshots/green-check.png) When all the accumulated `dplyr` verbs are executed, they are submitted to the dbms and the number of rows that are returned follow the same rules as discussed above.
### Interspersing SQL and dplyr
```{r date in both}
Q %>%
# mutate(birthdate = date(birthdate)) %>%
show_query()
# Need to come up with a different example illustrating where
# the `collect` statement goes.
# sales_person_table %>%
# mutate(birthdate = date(birthdate))
#
# try(sales_person_table %>%
# mutate(birthdate = lubridate::date(birthdate))
# )
#
# sales_person_table %>% collect() %>%
# mutate(birthdate = lubridate::date(birthdate))
```
This may not be relevant in the context where it turns out that dates in adventureworks come through as date!
The idea is to show how functions are interpreted BEFORE sending to the SQL translator.
```{r}
to_char <- function(date, fmt) {return(fmt)}
# sales_person_table %>%
# mutate(birthdate = to_char(birthdate, "YYYY-MM")) %>%
# show_query()
#
# sales_person_table %>%
# mutate(birthdate = to_char(birthdate, "YYYY-MM"))
```
### Many handy R functions can't be translated to SQL
![](screenshots/green-check.png) It just so happens that PostgreSQL has a `date` function that does the same thing as the `date` function in the `lubridate` package. In the following code the `date` function is executed by PostreSQL.
```{r postgresql homonym}
# sales_person_table %>% mutate(birthdate = date(birthdate))
```
![](screenshots/green-check.png) ![](screenshots/green-check.png) If we specify that we want to use the `lubridate` version (or any number of other R functions) they are passed to the dbms unless we explicitly tell `dplyr` to stop translating and bring the results back to the R environment for local processing.
```{r collect as delimiter}
try(sales_person_table %>% collect() %>%
mutate(birthdate = lubridate::date(birthdate)))
```
### Further lazy execution examples
See more examples of lazy execution [here](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html).
## Disconnect from the database and stop Docker
```{r}
dbDisconnect(con)
sp_docker_stop("adventureworks")
```
## Other resources
* Benjamin S. Baumer. 2017. A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data. [https://arxiv.org/abs/1708.07073](https://arxiv.org/abs/1708.07073)
* dplyr Reference documentation: Remote tables. [https://dplyr.tidyverse.org/reference/index.html#section-remote-tables](https://dplyr.tidyverse.org/reference/index.html#section-remote-tables)
* Data Carpentry. SQL Databases and R. [https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html)