-
Notifications
You must be signed in to change notification settings - Fork 5
/
trad03-basic_modeling.qmd
266 lines (181 loc) · 11.3 KB
/
trad03-basic_modeling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
title: "Basic modeling"
cache: false
---
So at this point we have point data for observation and background that have been joined with common environmental covariates (aka predictors). Here we show the basic steps taken to prepare, build and assess a model. Later, we'll try more sophisticated modeling, such as modeling by month or splitting the data into training-testing groups.
## Load data
Here we load the observation and background data points. We add a column identifying the month of the year.
```{r}
source("setup.R")
obs = sf::read_sf(file.path("data", "obs", "obs-covariates.gpkg")) |>
sf::st_set_geometry("geometry") |>
dplyr::mutate(month = factor(format(month_id, "%b"), levels = month.abb),
.before = geometry)
bkg = sf::read_sf(file.path("data", "bkg", "bkg-covariates.gpkg")) |>
sf::st_set_geometry("geometry") |>
dplyr::mutate(month = factor(format(month_id, "%b"), levels = month.abb),
.before = geometry)
```
## Prepare the input data
The input data must be formed as two parts:
+ a vector indicating which rows in the table are observations and which are background
+ a plain (non-spatial) table of covariates for both observation and background free of missing data
### The input table
Simply strip the spatial information off of `obs` and `bkg`, select just the environmental covariates, and then row bind them together
```{r}
input_obs = obs |>
sf::st_drop_geometry() |>
dplyr::select(dplyr::all_of(c("sst", "u_wind", "v_wind"))) |>
na.omit()
input_bkg = bkg |>
sf::st_drop_geometry() |>
dplyr::select(dplyr::all_of(c("sst", "u_wind", "v_wind"))) |>
na.omit()
input_table = dplyr::bind_rows(input_obs, input_bkg)
```
### The input vector
The each element of the input vector must have a 1 for each observation row, and a 0 for each background row. Since we arranged to have all of the the observations come first, we can easily make the vector with two calls to `rep()`.
```{r}
input_vector = c( rep(1, nrow(input_obs)), rep(0, nrow(input_bkg)) )
```
## Build the model
Here we pass our inputs to the `maxnet()` function, leaving all of the optional arguments to the default values. Be sure to look over the docs for model construction - try `?maxnet`
```{r}
model = maxnet::maxnet(input_vector, input_table)
```
That's it. The returned object is of `maxnet` class; it's essentially a list with all of the pertinent information required for subsequent use.
## Assess the model
So what do we know about the model? Is it any good?
One thing we can do is to plot what are called response curves. These show, for each parameter, how the model responds along the typical range of parameter values. We plot below the responses with type `cloglog` which transform the response value into the 0-1 range.
```{r}
plot(model, type = "cloglog")
```
Let's take a closer look starting with `sst`. The model has peak response to `sst` in the range (approximately) 12C-17C. `u_wind` shows peak model response in near calm (<5m/s) speeds, but interestingly eastward winds (positive) are more favorable for observation than westward winds. The north-south component of wind, `v_wind` seems to have strong discriminating power for northward winds (positive).
We can do more if we make a prediction, but, first, let's save the model to file so we can retrieve it later.
## Save the model
In this tutorial we are going to build three types of models: basic, monthly and split (between testing and training). We should organize the storage of the models in a way that makes sense. With each model we may generate one or more predictions - for example, for our basic model we might tyr to hind-cast a individual years. That's a one-to-many to relationship between model and predictions. We suggest that you start considering each model a version and store them accordingly. Let's use a simple numbering scheme...
+ `v1.0` for the basic model
+ `v2.jan, v2.feb, ..., v2.dec` for the monthly models
+ `v3.0, ...` for for the split model(s)
The [maxnetic](https://github.com/BigelowLab/maxnetic) provides some convenience functions for working with maxnet models including file storage functions.
```{r}
v1_path = file.path("data", "model", "v1", "v1.0")
ok = dir.create(v1_path, recursive = TRUE, showWarnings = FALSE)
maxnetic::write_maxnet(model, file.path(v1_path, "model_v1.0.rds"))
```
## Make a prediction
Now we can make predictions with our basic model. We'll do it two ways. First by simply feeding the input data used to create the model into the prediction. This might seems a bit circular, but it is perfectly reasonable to see how the model does on already labeled data. Second we'll make a prediction for each month in 2020 using raster data.
### Predict with a data frame
Here we provide a data frame, in our case the original input data, to the `predict()` function with type `cloglog` which transform the response value into the 0-1 range. In the histogram of predicted values below, we can see that most predicted values indicate low likelihood of observation. It's hard to know without context what that means about our model. Further investigation is required.
Note that we cast the predictors as a matrix to work with `maxnet::predict()`.
```{r}
#| width: "100%"
prediction = predict(model, input_table, type = 'cloglog')
hist(prediction, xlab = "prediction value", main = "Basic Model")
```
#### How did it do?
We can use some utilities in the [maxnetic](https://github.com/BigelowLab/maxnetic) package to help us assess the model. The `pAUC()` function will compute statistics, include a presence-only AUC value. We need to pass it two items - the universe of predictions and the predictions for just the presence points. The plot below shows the Receiver Operator Curve (ROC) and the associated presence-only AUC value.
```{r, warning = FALSE}
ix = input_vector > 0
pauc = maxnetic::pAUC(prediction, prediction[ix])
plot(pauc, title = "v1.0 Basic Model")
```
Overall, this is telling us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).
### Predict with rasters
We can also predict using raster inputs using our basic model. Let's read in rasters for each month of 2019, and then run a prediction for each month.
We provide a function `read_predictors()` that will read and bind the rasters together for you given the filtered databases and paths. So, first we define the paths and filter the databases to point to just the months in 2019.
```{r}
dates = as.Date(c("2019-01-01", "2019-12-31"))
sst_path = "data/oisst"
sst_db = oisster::read_database(sst_path) |>
dplyr::arrange(date) |>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
wind_path = "data/nbs"
wind_db = nbs::read_database(wind_path) |>
dplyr::arrange(date)|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
u_wind_db = wind_db |>
dplyr::filter(param == "u_wind")|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
v_wind_db = wind_db |>
dplyr::filter(param == "v_wind")|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
predictors = read_predictors(sst_db = sst_db,
u_wind_db = u_wind_db,
v_wind_db = v_wind_db)
predictors
```
You can see that we have the rasters in one object of three attributes (`sst`, `u_wind` and `v_wind`) each with 12 layers (Jan 2019 - Dec 2019). Now we can run the prediction.
```{r}
pred = predict(model, predictors, type = 'cloglog')
pred
```
Since we get a spatially mapped prediction back, we can plot it.
```{r}
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
sf::st_crop(pred)
plot_coast = function() {
plot(sf::st_geometry(coast), col = 'green', add = TRUE)
}
plot(pred, hook = plot_coast)
```
Well, that certainly looks appealing with higher likelihood of near shore observations occurring during the warmer months.
#### How did it do?
To compute an ROC and AUC for each month, we have a little bit of work to do. We need to extract the observations locations for each month from the prediction maps. These we can then plot.
:::{.callout-note}
We have to modify the date for each point to be the first date of each month. That's because our predictors are monthlies.
:::
```{r}
test_obs = obs |>
dplyr::filter(dplyr::between(month_id, dates[1], dates[2])) |>
dplyr::mutate(date = oisster::current_month(month_id))
x = stars::st_extract(pred, test_obs, time_column = 'date') |>
print()
```
Finally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.
```{r}
y = x |>
dplyr::mutate(month = factor(format(date, "%b"), levels = month.abb),
.before = 1) |>
dplyr::select(dplyr::all_of(c("month", "pred", "date"))) |>
dplyr::group_by(month)
dplyr::count(y, month) |>
print(n = 12)
```
Now how about one ROC plot for each month? Yikes! This requires a iterative approach, using `group_map()`, to compute the ROC for each month. We then follow with plot wrapping by the [patchwork](https://patchwork.data-imaginist.com/articles/guides/assembly.html#functional-assembly) package.
```{r, warning = FALSE}
#| width: "100%"
paucs = dplyr::group_map(y,
function(tbl, key, pred_rasters = NULL){
ix = key$month %in% month.abb
x = dplyr::slice(pred_rasters, "time", ix)
pauc = maxnetic::pAUC(x, tbl)
plot(pauc,title = key$month,
xlab = "", ylab = "")
}, pred_rasters = pred)
patchwork::wrap_plots(paucs, ncol = 4)
```
Hmmm. That's surprising, yes? Why during the summer months does our AUC go down when we have the most number of observations? That seems counter intuitive.
## Thinking about AUC
AUC is a diagnostic that provides a peek into the predictive power of a model. But what is it? An analogy is fitting a straight line to a small set of observations verses a large set of observations and then comparing the correlation coefficients. Here's a simple example using R's built-in dataset `cars` which is a data frame of 50 observations of speed and stopping distances of cars. We'll compute a linear model for the entire data set, and then a second for a small subsample of the data. (Learn more about linear models in R [here](https://rseek.org/?q=linear+models).)
```{r}
data("cars")
cars = dplyr::as_tibble(cars)
all_fit = lm(dist ~ speed, data = cars)
summary(all_fit)
```
```{r}
set.seed(5)
sub_cars = dplyr::slice(cars, c(10, 28, 50))
sub_fit = lm(dist ~ speed, data = sub_cars)
summary(sub_fit)
```
You can see that the `r-squared` value is quite high for the smaller data set, but the model may not be predictive over the full range of data. AUC is somewhat analogous to to `r-squared` in that a relatively low score does not necessarily suggest a poor model.
```{r}
ggplot2::ggplot(data = cars, ggplot2::aes(x = speed, y = dist)) +
ggplot2::geom_point(color = "blue") +
ggplot2::geom_abline(slope = coef(all_fit)[2], intercept = coef(all_fit)[1], color = "blue") +
ggplot2::geom_point(data = sub_cars, ggplot2::aes(x = speed, y = dist),
color = "orange", shape = 1, size = 3) +
ggplot2::geom_abline(slope = coef(sub_fit)[2], intercept = coef(sub_fit)[1], color = "orange")
```