-
Notifications
You must be signed in to change notification settings - Fork 39
/
programming.Rmd
548 lines (373 loc) · 19.6 KB
/
programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
# Programming in R {#programming}
## Objectives & Resources
Now we are going to build a little analysis. We will learn to automate our analyses with a for loop. We will make figures, and save them each with automated labeling. Then, we will join data from different files and conditionally label them with if/else statements.
### Objectives
- create an R script
- for loops
- joining data
- if statements
<!---
- make sure your loop worked like you wanted
- if statements (conditionals)
- write message() to yourself
- list.files()
- importing and writing data
- write a local copy of gapminder data to data/ folder
- installing packages from github
**Resources**
--->
<!--- Would love to include:
When trying to see if a number or text is equal to some *single* value, use `==`. To check it against *multiple* values, use `%in%`. mine’s "%!in%" <- Negate("%in%")
list.files
file.path()
message() (with an if statement maybe)
get file extensions https://stat.ethz.ch/R-manual/R-devel/library/tools/html/fileutils.html
--->
## Analysis plan
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(gapminder)
```
OK, here is the plan for our analysis. We want to plot the gdpPercap for each country in the gapminder data frame. So that's 142 separate plots! We will automate this, labeling each one with its name and saving it in a folder called figures. We will learn a bunch of things as we go.
## Create an R script
OK, now, we are going to create an R script. What is an R script? It's a text file with a .R extension. We've been writing R code in R Markdown files so far; R scripts are just R code without the Markdown along with it.
Go to File > New File > R Script (or click the green plus in the top left corner).
Let's start off with a few comments so that we know what it is for, and save it:
```
## gapminder-analysis.R
## analysis with gapminder data
## J Lowndes [email protected]
```
We'll be working with the gapminder data again so let's read it in here:
```{r load, message=FALSE}
## load libraries
library(tidyverse)
## read in gapminder data
gapminder <- readr::read_csv('https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv')
```
Remember, like in R Markdown, hitting return does not execute this command. To execute it, we need to get what we typed in the script down into the console. Here is how we can do that:
1. copy-paste this line into the console.
2. select the line (or simply put the cursor there), and click 'Run'. This is available from
a. the bar above the script (green arrow)
b. the menu bar: Code > Run Selected Line(s)
c. keyboard shortcut: command-return
3. source the script, which means running the whole thing. This is also great for to see if there are any typos in your code that you've missed. You can do this by:
a. clicking Source (blue arrow in the bar above the script).
b. typing `source('gapminder-analysis.R')` in the console (or from another R file!!!).
## Automation with for loops
Our plan is to plot gdpPercap for each country. This means that we want to do the same operation (plotting gdpPercap) on a bunch of different things (countries). Yesterday we learned the dplyr's `group_by()` function, and this is super powerful to automate through groups. But there are things that you may not want to do with `group_by()`, like plotting. So we will use a for loop.
Let's start off with what this would look like for just one country. I'm going to demonstrate with Afghanistan:
<!---TODO
For the figures, we want it to label the currency, which we have in another data file (=join). And, we'll want to add Westeros to the dataframe (=rbind) and create that figure too.
--->
```{r}
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == "Afghanistan")
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
labs(title = "Afghanistan")
```
Let's actually give this a better title than just the country name. Let's use the `base::paste()` function from to paste two strings together so that the title is more descriptive. Use `?paste` to see what the "sep" variable does.
```{r}
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == "Afghanistan")
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
## add title and save
labs(title = paste("Afghanistan", "GDP per capita", sep = " "))
```
And as a last step, let's save this figure.
```{r, eval = FALSE}
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == "Afghanistan")
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
## add title and save
labs(title = paste("Afghanistan", "GDP per capita", sep = " "))
ggsave(filename = "Afghanistan_gdpPercap.png", plot = my_plot)
```
OK. So we can check our repo in the file pane (bottom right of RStudio) and see the generated figure:
![](img/Afghanistan_gdpPercap.png)
### Thinking ahead: cleaning up our code
Now, in our code above, we've had to write out "Afghanistan" several times. This makes it not only typo-prone as we type it each time, but if we wanted to plot another country, we'd have to write that in 3 places too. It is not setting us up for an easy time in our future, and thinking ahead in programming is something to keep in mind.
Instead of having "Afghanistan" written 3 times, let's instead create an object that we will assign to "Afghanistan". We won't name our object "country" because that's a column header with gapminder, and will just confuse us. Let's make it distinctive: let's write cntry (country without vowels):
```{r, eval = FALSE}
## create country variable
cntry <- "Afghanistan"
```
Now, we can replace each `"Afghanistan"` with our variable `cntry`. We will have to introduce a `paste` statement here too, and we want to separate by nothing (`""`). Note: there are many ways to create the filename, and we are doing it this way for a specific reason right now.
```{r, eval = FALSE}
## create country variable
cntry <- "Afghanistan"
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == cntry)
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
## note: there are many ways to create filenames with paste() or file.path(); we are doing this way for a reason.
ggsave(filename = paste(cntry, "_gdpPercap.png", sep = ""), plot = my_plot)
```
Let's run this. Great! it saved our figure (I can tell this because the timestamp in the Files pane has updated!)
### For loop basic structure
Now, how about if we want to plot not only Afghanistan, but other countries as well? There wasn't actually that much code needed to get us here, but we definitely do not want to copy this for every country. Even if we copy-pasted and switched out the country assigned to the `cntry` variable, it would be very typo-prone. Plus, what if you wanted to instead plot lifeExp? You'd have to remember to change it each time...it gets messy quick.
Better with a for loop. This will let us cycle through and do what we want to each thing in turn. If you want to iterate over a set of values, and perform the same operation on each, a `for` loop will do the job.
**Sit back and watch me for a few minutes while we develop the for loop.** Then we'll give you time to do this on your computers as well.
The basic structure of a `for` loop is:
```{r, eval=FALSE}
for( each item in set of items ){
do a thing
}
```
Note the `( )` and the `{ }`. We talk about iterating through each item in the for loop, which makes each item an iterator.
So looking back at our Afghanistan code: all of this is pretty much the "do a thing" part. And we can see that there are only a few places that are specific to Afghanistan. If we could make those places not specific to Afghanistan, we would be set.
![](img/for_loop_logic.png)
Let's paste from what we had before, and modify it. I'm also going to use RStudio's indentation help to indent the lines within the for loop by highlighting the code in this chunk and going to Code > Reindent Lines (shortcut: command I)
```{r, eval=FALSE}
## create country variable
cntry <- "Afghanistan"
for (each cntry in a list of countries ) {
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == cntry)
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
ggsave(filename = paste(cntry, "_gdpPercap.png", sep = ""), plot = my_plot)
}
```
### Executable for loop!
OK. So let's start with the beginning of the for loop. We want a list of countries that we will iterate through. We can do that by adding this code before the for loop.
```{r, eval=FALSE}
## create a list of countries
country_list <- c("Albania", "Fiji", "Spain")
for ( cntry in country_list ) {
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == cntry)
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
ggsave(filename = paste(cntry, "_gdpPercap.png", sep = ""), plot = my_plot)
}
```
At this point, we do have a functioning for loop. For each item in the `country_list`, the for loop will iterate over the code within the `{ }`, changing `cntry` each time as it goes through the list. And we can see it works because we can see them appear in the files pane at the bottom right of RStudio!
Great! And it doesn't matter if we just use these three countries or all the countries--let's try it.
But first let's create a figure directory and make sure it saves there since it's going to get out of hand quickly. We could do this from the Finder/Windows Explorer, or from the "Files" pane in RStudio by clicking "New Folder" (green plus button). But we are going to do it in R. A folder is called a directory:
```{r, eval=FALSE}
dir.create("figures")
## create a list of countries
country_list <- unique(gapminder$country) # ?unique() returns the unique values
for( cntry in country_list ){
## filter the country to plot
gap_to_plot <- gapminder %>%
filter(country == cntry)
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
## add the figures/ folder
ggsave(filename = paste("figures/", cntry, "_gdpPercap.png", sep = "")), plot = my_plot)
}
```
So that took a little longer than just the 3, but still super fast. For loops are sometimes just the thing you need to iterate over many things in your analyses.
### Clean up our repo
OK we now have 142 figures that we just created. They exist locally on our computer, and we have the code to recreate them anytime. But, we don't really need to push them to GitHub. Let's delete the figures/ folder and see it disappear from the Git tab.
### Your turn
1. Modify our for loop so that it:
- loops through countries in Europe only
- plots the cumulative mean gdpPercap (Hint: Use the [Data Wrangling Cheatsheet](https://www.rstudio.com/resources/cheatsheets/)!)
- saves them to a new subfolder inside the (recreated) figures folder called "Europe".
1. Sync to GitHub
#### Answer
No peeking!
```{r, eval=FALSE}
dir.create("figures")
dir.create("figures/Europe")
## create a list of countries. Calculations go here, not in the for loop
gap_europe <- gapminder %>%
filter(continent == "Europe") %>%
mutate(gdpPercap_cummean = dplyr::cummean(gdpPercap))
country_list <- unique(gap_europe$country) # ?unique() returns the unique values
for( cntry in country_list ){ # (cntry = country_list[1])
## filter the country to plot
gap_to_plot <- gap_europe %>%
filter(country == cntry)
## add a print message to see what's plotting
print(paste("Plotting", cntry))
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap_cummean)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
ggsave(filename = paste("figures/Europe/", cntry, "_gdpPercap_cummean.png", sep = "")),
plot = my_plot)
}
```
Notice how we put the calculation for `cummean()` outside the for loop. It could have gone inside, but it's an operation that could be done just one time before hand (outside the loop) rather than multiple times as you go (inside the for loop).
## Conditional statements with `if` and `else`
Often when we're coding we want to control the flow of our actions. This can be done
by setting actions to occur only if a condition or a set of conditions are met.
In R and other languages, these are called "if statements".
### if statement basic structure
```{r, eval=FALSE}
# if
if (condition is true) {
do something
}
# if ... else
if (condition is true) {
do something
} else { # that is, if the condition is false,
do something different
}
```
Let's bring this concept into our for loop for Europe that we've just done. What if we want to add the label "Estimated" to countries that were estimated? Here's what we'd do.
First, import csv file with information on whether data was estimated or reported, and join to gapminder dataset:
```{r}
est <- readr::read_csv('https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/countries_estimated.csv')
gapminder_est <- left_join(gapminder, est)
```
```{r, eval=FALSE}
dir.create("figures")
dir.create("figures/Europe")
## create a list of countries
gap_europe <- gapminder_est %>% ## use instead of gapminder
filter(continent == "Europe") %>%
mutate(gdpPercap_cummean = dplyr::cummean(gdpPercap))
country_list <- unique(gap_europe$country)
for( cntry in country_list ){ # (cntry = country_list[1])
## filter the country to plot
gap_to_plot <- gap_europe %>%
filter(country == cntry)
## add a print message
print(paste("Plotting", cntry))
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap_cummean)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
## if estimated, add that as a subtitle.
if (gap_to_plot$estimated == "yes") {
## add a print statement just to check
print(paste(cntry, "data are estimated"))
my_plot <- my_plot +
labs(sutbtitle("Estimated data"))
}
# Warning message:
# In if (gap_to_plot$estimated == "yes") { :
# the condition has length > 1 and only the first element will be used
ggsave(filename = paste("figures/Europe/", cntry, "_gdpPercap_cummean.png", sep = ""),
plot = my_plot)
}
```
This worked, but we got a warning message with the if statement. This is because if we look at `gap_to_plot$estimated`, it is many "yes"s or "no"s, and the if statement works just on the first one. We know that if any are yes, all are yes, but you can imagine that this could lead to problems down the line if you *didn't* know that. So let's be explicit:
### Executable if statement
```{r, eval=FALSE}
dir.create("figures")
dir.create("figures/Europe")
## create a list of countries
gap_europe <- gapminder_est %>% ## use instead of gapminder
filter(continent == "Europe") %>%
mutate(gdpPercap_cummean = dplyr::cummean(gdpPercap))
country_list <- unique(gap_europe$country)
for( cntry in country_list ){ # (cntry = country_list[1])
## filter the country to plot
gap_to_plot <- gap_europe %>%
filter(country == cntry)
## add a print message
print(paste("Plotting", cntry))
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap_cummean)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
## if estimated, add that as a subtitle.
if (any(gap_to_plot$estimated == "yes")) { # any() will return a single TRUE or FALSE
print(paste(cntry, "data are estimated"))
my_plot <- my_plot +
labs(subtitle = "Estimated data")
}
ggsave(filename = paste("figures/Europe/", cntry, "_gdpPercap_cummean.png", sep = ""),
plot = my_plot)
}
```
OK so this is working as we expect! Note that we do not need an `else` statement above, because we only want to do something (add a subtitle) if one condition is met. But what if we want to add a different subtitle based on another condition, say where the data are reported, to be extra explicit about it?
### Executable if/else statement
```{r, eval=FALSE}
dir.create("figures")
dir.create("figures/Europe")
## create a list of countries
gap_europe <- gapminder_est %>% ## use instead of gapminder
filter(continent == "Europe") %>%
mutate(gdpPercap_cummean = dplyr::cummean(gdpPercap))
country_list <- unique(gap_europe$country)
for( cntry in country_list ){ # (cntry = country_list[1])
## filter the country to plot
gap_to_plot <- gap_europe %>%
filter(country == cntry)
## add a print message
print(paste("Plotting", cntry))
## plot
my_plot <- ggplot(data = gap_to_plot, aes(x = year, y = gdpPercap_cummean)) +
geom_point() +
## add title and save
labs(title = paste(cntry, "GDP per capita", sep = " "))
## if estimated, add that as a subtitle.
if (any(gap_to_plot$estimated == "yes")) { # any() will return a single TRUE or FALSE
print(paste(cntry, "data are estimated"))
my_plot <- my_plot +
labs(subtitle = "Estimated data")
} else {
my_plot <- my_plot +
labs(subtitle = "Reported data")
print(paste(cntry, "data are reported"))
}
ggsave(filename = paste("figures/Europe/", cntry, "_gdpPercap_cummean.png", sep = ""),
plot = my_plot)
}
```
Note that this works because we know there are only two conditions, `Estimated == yes` and `Estimated == no`. In the first `if` statement we asked for estimated data, and the `else` condition gives us everything else (which we know is reported). We can be explicit about setting these conditions in the `else` clause by instead using an `else if` statement. Below is how you would construct this in your for loop, similar to above:
```{r, eval == F}
if (any(gap_to_plot$estimated == "yes")) { # any() will return a single TRUE or FALSE
print(paste(cntry, "data are estimated"))
my_plot <- my_plot +
labs(subtitle = "Estimated data")
} else if (any(gap_to_plot$estimated == "no")){
my_plot <- my_plot +
labs(subtitle = "Reported data")
print(paste(cntry, "data are reported"))
}
```
This construction is necessary if you have more than two conditions to test for.
## More R!
With just a little bit of time left, here are some things that you can look into more on your own.
### Importing and Installing
Here are some really helpful packages for you to work with:
Remember you'll use `install.packages("package-name-in-quotes")` to install from CRAN.
- `readr` to read in .csv files
- `readxl` to read in Excel files
- `stringr` to work with strings
- `lubridate` to work with dates
You are also able to install packages directly with Github, using the `devtools` package. Then, instead of `install.packages()`, you'll use `devtools::install_github()`. And you can create *your own* packages when you're ready. Read http://r-pkgs.had.co.nz/ to learn how!
### Organization and workflows
- set up a folder for figs, intermediate analyses, final outputs, figures
### Getting help
You'll soon have questions that are outside the scope of this workshop, how do you find answers?
- end with a ton of resources:
https://peerj.com/collections/50-practicaldatascistats/
## Ideas for Extended Analysis 2
- stringr() http://r4ds.had.co.nz/strings.html