-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path160-anti-join-cost-comparisons.Rmd
184 lines (146 loc) · 4.67 KB
/
160-anti-join-cost-comparisons.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
# Anti-join cost comparisons {#chapter_anti-join-cost-comparisons}
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
# These packages are called in almost every chapter of the book:
library(tidyverse)
library(DBI)
library(RPostgres)
library(glue)
library(here)
require(knitr)
library(dbplyr)
library(sqlpetr)
```
Verify Docker is up and running:
```{r}
sp_check_that_docker_is_up()
```
Verify pet DB is available, it may be stopped.
```{r}
sp_show_all_docker_containers()
```
Start up the `docker-pet` container
```{r}
sp_docker_start("sql-pet")
```
Now connect to the database with R
```{r}
# need to wait for Docker & Postgres to come up before connecting.
con <- sp_get_postgres_connection(
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "dvdrental",
seconds_to_test = 30, connection_tab = TRUE
)
```
```{r collapse=TRUE}
source(file=here('book-src/sql_pet_data.R'),echo=TRUE)
```
Explain plans [here](https://robots.thoughtbot.com/reading-an-explain-analyze-query-plan)
## SQL anti join Costs
```{r}
sql_aj1 <- dbGetQuery(
con,
"explain analyze
select 'aj' join_type, customer_id, first_name, last_name, c.store_id,count(*) ajs
from customer c left outer join store s on c.store_id = s.store_id
where s.store_id is null
group by customer_id, first_name, last_name, c.store_id
order by c.customer_id;"
)
sql_aj2 <- dbGetQuery(
con,
"explain analyze
select 'aj' join_type, customer_id, first_name, last_name, c.store_id,count(*) ajs
from customer c
where c.store_id NOT IN (select store_id from store)
group by customer_id, first_name, last_name, c.store_id
order by c.customer_id;"
)
sql_aj3 <- dbGetQuery(
con,
"explain analyze
select 'aj' join_type, customer_id, first_name, last_name, c.store_id,count(*) ajs
from customer c
where not exists (select s.store_id from store s where s.store_id = c.store_id)
group by customer_id, first_name, last_name, c.store_id
order by c.customer_id
"
)
```
##### SQL Costs
```{r}
print(glue("sql_aj1 loj-null costs=", sql_aj1[1, 1]))
print(glue("sql_aj2 not in costs=", sql_aj2[1, 1]))
print(glue("sql_aj3 not exist costs=", sql_aj3[1, 1]))
```
## dplyr Anti joins
In this next section we look at two methods to implemnt an anti join in dplyr.
```{r}
customer_table <- tbl(con, "customer") # DBI::dbReadTable(con, "customer")
rental_table <- tbl(con, "rental") # DBI::dbReadTable(con, "rental")
# Method 1. dplyr anti_join
daj1 <-
anti_join(customer_table, rental_table, by = "customer_id", suffix = c(".c", ".r")) %>%
select(c("first_name", "last_name", "email")) %>%
explain()
```
```{r}
customer_table <- tbl(con, "customer") # DBI::dbReadTable(con, "customer")
rental_table <- tbl(con, "rental") # DBI::dbReadTable(con, "rental")
# Method 2. dplyr loj with NA
daj2 <-
left_join(customer_table, rental_table, by = c("customer_id", "customer_id"), suffix = c(".c", ".r")) %>%
filter(is.na(rental_id)) %>%
select(c("first_name", "last_name", "email")) %>%
explain()
```
<!--
### dplyr Costs
```
<PLAN>
Hash Anti Join (cost=510.99..529.72 rows=1 width=45)
Hash Cond: ("TBL_LEFT".customer_id = "TBL_RIGHT".customer_id)
-> Seq Scan on customer "TBL_LEFT" (cost=0.00..14.99 rows=599 width=49)
-> Hash (cost=310.44..310.44 rows=16044 width=2)
-> Seq Scan on rental "TBL_RIGHT" (cost=0.00..310.44 rows=16044 width=2)
```
```
<PLAN>
Hash Right Join (cost=22.48..375.33 rows=1 width=45)
Hash Cond: ("TBL_RIGHT".customer_id = "TBL_LEFT".customer_id)
Filter: ("TBL_RIGHT".rental_id IS NULL)
-> Seq Scan on rental "TBL_RIGHT" (cost=0.00..310.44 rows=16044 width=6)
-> Hash (cost=14.99..14.99 rows=599 width=49)
-> Seq Scan on customer "TBL_LEFT" (cost=0.00..14.99 rows=599 width=49)
```
-->
In this example, the dplyr anti_join verb is *`r 529.72/375.33` to `r 510.99/22.48`* times more expensive than the left outer join with a null condition.
```{r}
sql_aj1 <- dbGetQuery(
con,
"explain analyze select c.customer_id,count(*) lojs
from customer c left outer join rental r on c.customer_id = r.customer_id
where r.customer_id is null
group by c.customer_id
order by c.customer_id;"
)
sp_print_df(sql_aj1)
sql_aj1
sql_aj3 <- dbGetQuery(
con,
"explain analyze
select c.customer_id,count(*) lojs
from customer c
where not exists (select customer_id from rental r where c.customer_id = r.customer_id)
group by c.customer_id
"
)
print(glue("sql_aj1 loj-null costs=", sql_aj1[1, 1]))
print(glue("sql_aj3 not exist costs=", sql_aj3[1, 1]))
```
## Cleaning up
Always have R disconnect from the database when you're done and stop the Adventureworks Container
```{r}
dbDisconnect(con)
sp_docker_stop("adventureworks")
```