-
Notifications
You must be signed in to change notification settings - Fork 33
/
lec11-spatial-stats.Rmd
585 lines (386 loc) · 20.7 KB
/
lec11-spatial-stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
---
title: "Spatial statistics"
author: "Amber Gigi Hoi"
output:
pdf_document: default
word_document: default
---
# Lesson preamble:
> ### Lesson objectives:
>
> - Understand the basics of spatial data in R
> - Learn simple spatial data manipulations
> - Fantastic rasters and how to use them
> - Incorporate spatial structure in regression models
>
> ### Lesson outline:
> Total lesson time: 1.5 hours
>
> - An introduction to spatial data and objects in R (20 min)
> - Plotting and extracting information from rasters (20 min)
> - Detecting and modelling spatial dependence (50 min)
> - Optional: making simple maps in ggplot (20 min)
---
# Introduction
## Why do we care about space?
Everything in ecology plays out in space. How heterogeneity in abiotic factors
across space drive ecological processes (e.g., species interactions) and
patterns (e.g., species distributions) has been of interest to ecologists for a
long time.
From an implicit perspective, the consideration of space is important to us all.
Whenever we have data collected across space, e.g., any sort of field survey,
the independence assumption is typically not valid, as sites closer in space
will tend to exhibit similar properties. This correlation in space can occur for
two reasons (not mutually exclusive):
1. Endogeneous mechanisms: processes inherent to the organism being considered,
such as low dispersal capacity or social behaviour (e.g., schooling and
herding).
2. Exogeneous mechanisms: processes unrelated to the organism being considered,
such aggregation of resources or climate gradients.
Spatial autocorrelation may be classified as either positive or negative. A
positive correlation means similar values appear together, while a negative case
has dissimilar values appearing nearby.
Recall the mixed model lecture. When there is non-independence in regressions
that were not properly accounted for, we increase our Type I error (false
positive) rates, and risk inferring patterns that don't in fact exist. Checking
for spatial autocorrelation is therefore a crucial step in any regression
involving spatial data.
## Set up
In this lecture, we will continue our quest into vector ecology, only this time
asking a much simpler question: what is the effect of mosquito abundance on
malaria prevalence? We will explicitly consider the role of space when we answer
this question.
Here are the packages we will be using today:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(MASS)
library(PerformanceAnalytics)
library(nlme)
library(sp)
library(ape)
library(rgdal) # R Geospatial Data Abstraction Library
library(raster)
library(maps)
```
And here is the data we will be working with:
```{r eval=FALSE}
download.file("https://uoftcoders.github.io/rcourse/data/kenya.wide",
"kenya.wide")
download.file("https://uoftcoders.github.io/rcourse/data/africa.wide",
"africa.wide")
download.file("https://uoftcoders.github.io/rcourse/data/wc2.0_bio_10m_01",
"wc2.0_bio_10m_01.tif")
download.file("https://uoftcoders.github.io/rcourse/data/wc2.0_bio_10m_12",
"wc2.0_bio_10m_12.tif")
```
```{r message=FALSE, warning=FALSE}
kenya.wide <- read_csv("data/kenya.wide.csv")
africa.wide <- read_csv("data/africa.wide.csv")
```
The _kenya.wide_ data was introduced in the multivariate statistics lecture, and
we will revisit it briefly to illustrate working with spatial data at different
scales (regional vs. continental). The _africa.wide_ data was originally
obtained from the Malaria Atlas Project, an open-access, everything-malaria,
database created and maintained by an international consortium of Malaria
experts. In brief, data for global vector occurrence and malaria prevalence were
extracted, matched in space (within 1 km^2^ of each other) and time (overlapped
in study duration), and validated against original sources. Additional data
(e.g., climate, GDP, vector abundance) were sourced via diverse means. For the
purposes of this lecture, we will be focusing on a subset of sites located in
Africa.
# A _very_ brief introduction to spatial data
Spatial data can be broadly classified into two groups: vectors and rasters.
Vectors are more "free-formed" in the information they store:
* Points -- a.k.a. vertices, a literal point position in space
* Lines -- connects vertices, such as a road or a river
* Polygon -- connects vertices into a closed area, such as a park perimeter or provincial boundaries
Rasters, a.k.a. grids, on the other hand, are much more rigid, as the name
suggests. This data type divide the landscape into identical and
regularly-spaced pixels, and stores values in each of these pixels. Rasters can
be fine or coarse, and doesn't have to be squares or rectangles. The pixels can
even be so fine that they are essentially a continuous scale, representing
gradual change.
## Coordinate reference systems (CRS)
The fundamental goal of encoding spatial data is to convey where the data was
collected. These are typically given as _coordinates_, as in the latitude and
longitude we are all familiar with. In reality, however, it is a lot more
complex to describe geographic locations. There are two broad classes of
coordinate systems:
1. Unprojected (aka geographic): Latitude and Longitude locations on the ellipsoid Earth
2. Projected: location on 2D representations of Earth (e.g., maps).
_Projection_ refers to the process of projecting latitude and longitude from
Earth the earth's surface (which is a ellipse) onto a flat surface by some
standard formula known as a map projection. The main problem here is that it is
not possible to flatten a round object without distortion. This results in
trade-offs between area, direction, shape, and distance, because these features
cannot be simultaneously perserved when projecting from 3D to 2D. Unprojected
data is therefore always perferred, but alas, globes are inconvenient to use
(e.g., the distance described by one unit changes depending on where on earth
you are), so maps still predominate. There are many projections available --
there is no _best_ projection, but we can prioritize which features are
important to perserve in each case.
The Universal Transverse Mercator (UTM) projection is commonly used in research
because it is more accurate at smaller scales, especially in distance
estimations. The UTM projection divides the Earth into 60 zones and uses a
different transverse Mercator projection in each zone to reduce distance
distortions. That said, one of the major disadvantage of UTM is that it is not
suitable for use over large areas.
For any projections, we need a model of the shape of the earth to work off of,
or, in geography lingo, a _datum_. The most commonly used datum is the World
Geodesic System 1984 (WGS84). There are, again, many datums available, including
a selection of local datum which often does a better job at recording locations
for single country or region.
The most basic way to record spatial data is therefore a pair of coordinates and
the reference datum. More sophisticated means involve a complete coordinate
reference system (CRS). CRSs are central to how software such as R and ArcGIS
read locations, however, it is not trivial to code a CRS as any number of
parameters (other than coordinates and datum) may be require to properly specify
a CRS. Thankfully, a lot of commonly used CRSs have been assigned simplied EPSG
(European Petroleum Survey Group) codes, which are unique IDs that identifies a
CRS. For example, the CRS for Kenya can be specified as either
```{r eval=FALSE}
CRS("+init=epsg:32737")
```
OR
```{r eval=FALSE}
CRS("+proj=utm +zone=37 +south +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0")
```
Note that CRSs in R follow a very rigid format. Each piece of information (a
tag) is entered as a string, indicated by +, and each tag is separated by a
single space. Spacing within each tag is not allowed.
## Spatial data in practice
The data we will be using today were not created as spatial objects, which means
they don't come with defined coordinate systems (in other words, R or ArcGIS
won't know that they refer to locations). The first thing we need to do is
therefore to assign a CRS.
_CAUTION_: only do this if you know what the CRS should be!
First, we need to make our numeric points into spatial points, and then
transform it from geographic coordinates (latitude and longitude) to map
projections for more accurate calculations later on. As a exmaple, we will use
the UTM projection for the Kenya data we previously worked with, because all of
those points fitted nicely into a single UTM zone.
```{r}
coord.kenya <- SpatialPoints(cbind(kenya.wide$long, kenya.wide$lat),
proj4string=CRS("+proj=longlat +ellps=WGS84"))
UTM.kenya <- spTransform(coord.kenya,
CRS("+init=epsg:32737"))
# EGSP:32737 for Kenya, Mozambique, and Tanzania
```
For the Africa data, however, the UTM projection is no longer appropriate as the
data now span a much larger geographic area. We will be using the _sinusoidal_
projection instead. This is an equal-area projection and does a reasonable job
at perserving distance as well, especially well for locations near the equator,
so it appropriate for our case.
```{r}
coord.africa <- SpatialPoints(cbind(africa.wide$long, africa.wide$lat),
proj4string=CRS("+proj=longlat +ellps=WGS84"))
sinu.africa <- spTransform(coord.africa,
CRS("+proj=sinu +ellps=WGS84"))
# Note name of columns - x1 for longitude and x2 for latitude
```
We can now add our shiny new projected coordinates to main data file
```{r}
africa.wide <- bind_cols(africa.wide, as.data.frame(sinu.africa)) # Note order of rows!
```
We could further add environmental data from _rasters_ we pull from the
interwebs. Here, we extract annual temperature and annual rainfall for our sites
from a historic climate data repository, WorldClim2 (accessed Oct 2019). Again,
note the special function _stack()_ used to load raster files.
```{r}
temp <- raster::stack("data/wc2.0_bio_10m_01.tif")
rain <- raster::stack("data/wc2.0_bio_10m_12.tif")
```
Let's have a look at these files. Simplest way is to do so is to plot them.
```{r}
plot(temp)
points(coord.africa)
plot(rain)
points(coord.africa)
```
We can extract data from raster files for our sites with a simple and
appropriately named function _extract_.
```{r}
extractions.t <- raster::extract(temp, coord.africa, df = TRUE, method="bilinear")
extractions.r <- raster::extract(rain, coord.africa, df = TRUE, method="bilinear")
```
The _df_ argument tells R to return a data frame. In case there is a missing
value in our raster file, _method="bilinear"_ will interpolate from the four
nearest raster cells.
Since the resulting object is a data frame, we can easily bind this new data to our main data file.
```{r}
# Change colnames
colnames(extractions.t)[2] <- "ann.temp"
colnames(extractions.r)[2] <- "ann.rain"
extractions <- left_join(extractions.t, extractions.r, by="ID")
africa.wide <- as.data.frame(bind_cols(africa.wide, extractions)) # Note order of rows!
```
_Sidebar_: WorldClim2 is a one-stop shop for a bunch of climate variables. There
are other open-source data repositories out there, such as WorldPop for
demographics, World Bank for economics data. There are lots of opportunities out
there for you to build custom datasets and ask lots of interesting questions.
For this lecture, however, we will focus on vector ecology, and therefore, not
consider these environmental variables further.
# A _very_ brief introduction to spatial statistics
In this section, we will learn the basics of spatial statistics, including how
to assess and correct for spatial autocorrelation in our analyses.
We'll assess for spatial autocorrelation with Moran's autocorrelation
coefficient, a.k.a Moran's I. To calculate Moran's I, we first need to create a
matrix of _inverse distance weights_, which describes how close together each
site is. It is very easy to do with three simple steps in base R (sorry, this is
just so much cleaner in base!).
```{r warning=FALSE}
# Generate distance matrix
dist.matrix <- as.matrix(dist(data.frame(sinu.africa)))
# Take reciprocal of each number
inv.dist <- 1/dist.matrix
# Replace diagonal with zeros
diag(inv.dist) <- 0
```
Inverse distance weights are now ready, and we can assess whether our variables
of interest are correlated in space! We will be using the self-explanatory
function Moran.I for this test.
```{r}
Moran.I(africa.wide$PfPR, inv.dist, alternative="two.sided")
# Two sided = checks for both positive and negative spatial autocorrelation
```
The null hypothesis is zero spatial autocorrelation present in the variable. A
significant p-value therefore means that sites that are closer together have
more similar prevalence in malaria. Let's repeat this analysis for our predictor
variable.
```{r eval=FALSE}
Moran.I(africa.wide$total.abundance, inv.dist, alternative="two.sided")
```
Turns out, sites close to one another also have similar mosquito abundances.
## Visualizing spatial autocorrelation with semivariograms
Another way to explore spatial autocorrelation is to calculate semivariance.
Semivariance is a measure of the degree of similarity between pairs of points
separated by a specific distance. If the data are spatial autocorrelated, then
sampling units that are closer together in space might be expected to yield
similar responses and thus similar residuals. The _semivariogram_ (used
interchangebly with variogram) is a special kind of residual plot which displays
the variances within groups of observations, plotted as a function of distance
between the observations.
To make a semivariogram, first we fit a null model without any spatial
sturcture. We will be fitting this model with the function _gls()_ instead of
_lm()_ to facilitate future comparisons. These two differ in their method of
parameter estimation (ordinary least square vs. maximum likelihood) but
otherwise achieve the same thing.
```{r}
mod <- gls(log(PfPR)~log(total.abundance), data=africa.wide, method="ML") # ML for max likelihood
plot(Variogram(mod, form = ~coords.x1+coords.x2, resType = "normalized"))
```
_Note_: This **V**ariogram() function comes from _nlme_ package, and should not
be confused with **v**ariogram() from package _gstat_. Although they perform the
same semivariance analysis, these functions take different arguments, so R will
get confused if we are not careful.
In spatially correlated data, semivariance typically increases with increasing
distance up to a point (the sill). The span of distances over which points are
correlated is called the range. In the above variogram, the range is
approximately "2e+06". The sinusoidal projection we used on our geographic data
presents distance in meters, therefore, we can conclude that sites within 200 km
of each other tend to have similar mosquito abundances. Note that semivariance
does not always plateau, but may exhibit forms of cycling. This can be
indicative of the underlying environment conditions (e.g., types of vegetation)
showing similar patterns across a landscape.
## Accounting for spatial autocorrelation in statistical models
There are -- you guessed it! -- many ways to incorporate spatial autocorrelation
structure into regression analyses. One of the most stragihtfoward ways to do so
is to think of it as a random effect. The _gls()_ function we used above allows
us to do just that. This method is simple and works really well if normality
assumptions are met.
There are -- you guessed it again! -- five autoregressive correlation structures
we can model. They all describe the degrading of correlation between samples as
you move away in space, with the different functional forms dictates the rate of
decay as you move away. These different functions are necessary because the
cause and consequences of spatial autocorrelation can differ across datasets.
For instance, the way that the sites are clustered (or not clustered) can have
an impact on how autocorrelation should be modelled. The five options are:
1. Exponential
2. Gaussian
3. Linear
4. Rational quadratic
5. Spherical
_Note:_ all of these assume _isotropy_, which means data are uniform (or
uniformly different) in all directions. We can imagine situations where that may
not be the case, for example, a river carrying substrate downstream, or when
wind consistently blow in one direction. We can actually check for those
assumptions with variograms, but that is beyond the scope of this lecture.
While sometimes you can tell from the variogram which structure is most
suitable, it is very common practice to fit all the alternative models and then
chose the one with the lowest AIC.
So here we go, the grand finale where we input the spatial structure into our model.
```{r}
mod.exp <- gls(log(PfPR)~log(total.abundance),
data=africa.wide,
method="ML",
corr=corSpatial(form=~coords.x1+coords.x2, type ="exponential"))
```
Repeat that four more times, then inspect AIC scores.
```{r results="hide"}
mod.gau <- gls(log(PfPR)~log(total.abundance), data=africa.wide, method="ML",
corr=corSpatial(form=~coords.x1+coords.x2, type ="gaussian"))
mod.lin <- gls(log(PfPR)~log(total.abundance), data=africa.wide, method="ML",
corr=corSpatial(form=~coords.x1+coords.x2, type ="linear"))
mod.rat <- gls(log(PfPR)~log(total.abundance), data=africa.wide, method="ML",
corr=corSpatial(form=~coords.x1+coords.x2, type ="rational"))
mod.sp <- gls(log(PfPR)~log(total.abundance), data=africa.wide, method="ML",
corr=corSpatial(form=~coords.x1+coords.x2, type ="spherical"))
```
```{r}
AIC(mod, mod.exp, mod.gau, mod.lin, mod.rat, mod.sp)
```
The two things to take away from this analysis are:
1. mod.exp has the lowest AIC score, and is thus our best model. The exponential
autocorrelation strucutre was the most appropriate to use in this case.
2. The model without any spatial autorcorrelation strucutre performed horribly.
Good thing we thought to include space!
Now, have a look at the model outputs with and without spatial structure:
```{r}
summary(mod)
summary(mod.exp)
```
Oh no! After accounting for spatial autocorrelation, the relationship between
total abundance and malaria prevalence went away! Now, that doesn't mean that
mosquito abundance doesn't have any effect on disease. This means that, across
such large spatial scale, other processes become more important in determining
malaria prevalence. The interpretation of these results are indeed the same as a
mixed model.
## Appendix: A _very very very_ brief intorduction to making maps with ggplot
There are so many ways one can make maps in both base R and ggplot and beyond.
In this example, we will be making use of maps taken from the _maps_ package and
customizing them in ggplot. The issue with this pacakge is that you always have
to start with a world map and then subset by country. This means that when you
want to plot Africa, you have to sit there and subset all day.
There is a more _pure_ ggplot way of making maps, called _ggmap_. This package
extracts maps from Google maps, which is awesome because it gives you access to
high quality maps from all over the world, and you can zoom in on whichever
region you want. However, you do have to register with Google to use it, and
Google asks for your credit card number when you register. So, not great.
We will therefore go with the free-all-the-way route, and use _maps_. Instead of
subseting the map officially, we'll cheat a little and just crop the figure to
show Africa. We'll also do an example of plotting a single country (Kenya).
```{r}
map.world <- map_data("world")
# Note: you must include group=group in aes() to ensure proper plotting
ggplot() +
geom_polygon(data = map.world, aes(x=long, y=lat, group=group), fill=NA, colour="gray85") +
coord_fixed() + # Prevents map from being distorted
geom_point(data=africa.wide, aes(x=long, y=lat, colour=total.abundance)) +
xlim(-20,60) + # Approximate longitudinal and latitudinal boundaries of Africa
ylim(-40,40) +
theme_bw()
map.kenya <- map_data("world", region="kenya")
africa.wide %>%
filter(country=="Kenya") %>%
ggplot() +
geom_polygon(data=map.kenya, aes(x=long, y=lat, group=group), fill=NA, colour="gray85") +
coord_fixed() +
geom_point(data=africa.wide[africa.wide$country=="Kenya", ], aes(x=long, y=lat, colour=total.abundance))
```
## Additional readings
Fick, S.E. and R.J. Hijmans, 2017. WorldClim 2: new 1‐km spatial resolution
climate surfaces for global land areas. Int J Climatol 37: 4302-4315.
Fletcher, R. & Fortin, M.-J. (2018). Spatial Ecology and Conservation Modeling.
Springer, Switzerland.
Hay, S. & Snow, R. (2006). The Malaria Atlas Project: Developing global maps of
malaria risk. PLoS Med 3: e473.