trad03-basic_modeling.qmd

---
title: "Basic modeling"
cache: false
---

So at this point we have point data for observation and background that have been joined with common environmental covariates (aka predictors).  Here we show the basic steps taken to prepare, build and assess a model. Later, we'll try more sophisticated modeling, such as modeling by month or splitting the data into training-testing groups.


## Load data

Here we load the observation and background data points. We add a column identifying the month of the year. 

```{r}
source("setup.R")

obs = sf::read_sf(file.path("data", "obs", "obs-covariates.gpkg")) |>
  sf::st_set_geometry("geometry") |>
  dplyr::mutate(month = factor(format(month_id, "%b"), levels = month.abb), 
                .before = geometry)

bkg = sf::read_sf(file.path("data", "bkg", "bkg-covariates.gpkg")) |>
  sf::st_set_geometry("geometry") |>
  dplyr::mutate(month = factor(format(month_id, "%b"), levels = month.abb), 
                .before = geometry)
```


## Prepare the input data

The input data must be formed as two parts:

  + a vector indicating which rows in the table are observations and which are background
  
  + a plain (non-spatial) table of covariates for both observation and background free of missing data


### The input table

Simply strip the spatial information off of `obs` and `bkg`, select just the environmental covariates, and then row bind them together

```{r}
input_obs = obs |>
  sf::st_drop_geometry() |> 
  dplyr::select(dplyr::all_of(c("sst", "u_wind", "v_wind"))) |>
  na.omit()

input_bkg = bkg |>
  sf::st_drop_geometry() |> 
  dplyr::select(dplyr::all_of(c("sst", "u_wind", "v_wind"))) |>
  na.omit()

input_table = dplyr::bind_rows(input_obs, input_bkg)
```

### The input vector

The each element of the input vector must have a 1 for each observation row, and a 0 for each background row. Since we arranged to have all of the the observations come first, we can easily make the vector with two calls to `rep()`.

```{r}
input_vector = c( rep(1, nrow(input_obs)), rep(0, nrow(input_bkg)) )
```

## Build the model

Here we pass our inputs to the `maxnet()` function, leaving all of the optional arguments to the default values. Be sure to look over the docs for model construction - try `?maxnet`

```{r}
model = maxnet::maxnet(input_vector, input_table)
```

That's it.  The returned object is of `maxnet` class; it's essentially a list with all of the pertinent information required for subsequent use.

## Assess the model

So what do we know about the model?  Is it any good?

One thing we can do is to plot what are called response curves.  These show, for each parameter, how the model responds along the typical range of parameter values. We plot below the responses with type `cloglog` which transform the response value into the 0-1 range.

```{r}
plot(model, type = "cloglog")
```

Let's take a closer look starting with `sst`.  The model has peak response to `sst` in the range (approximately) 12C-17C.  `u_wind` shows peak model response in near calm (<5m/s) speeds, but interestingly eastward winds (positive) are more favorable for observation than westward winds.  The north-south component of wind, `v_wind` seems to have strong discriminating power for northward winds (positive).  

We can do more if we make a prediction, but, first, let's save the model to file so we can retrieve it later. 

## Save the model

In this tutorial we are going to build three types of models: basic, monthly and split (between testing and training).  We should organize the storage of the models in a way that makes sense. With each model we may generate one or more predictions - for example, for our basic model we might tyr to hind-cast a individual years.  That's a one-to-many to relationship between model and predictions.  We suggest that you start considering each model a version and store them accordingly.  Let's use a simple numbering scheme...

 + `v1.0` for the basic model
 + `v2.jan, v2.feb, ..., v2.dec` for the monthly models
 + `v3.0, ...` for for the split model(s)
 
The [maxnetic](https://github.com/BigelowLab/maxnetic) provides some convenience functions for working with maxnet models including file storage functions.

```{r}
v1_path = file.path("data", "model", "v1", "v1.0")
ok = dir.create(v1_path, recursive = TRUE, showWarnings = FALSE)
maxnetic::write_maxnet(model, file.path(v1_path, "model_v1.0.rds"))
```

## Make a prediction

Now we can make predictions with our basic model. We'll do it two ways. First by simply feeding the input data used to create the model into the prediction.  This might seems a bit circular, but it is perfectly reasonable to see how the model does on already labeled data.  Second we'll make a prediction for each month in 2020 using raster data. 

### Predict with a data frame

Here we provide a data frame, in our case the original input data, to the `predict()` function with type `cloglog` which transform the response value into the 0-1 range.  In the histogram of predicted values below, we can see that most predicted values indicate low likelihood of observation.  It's hard to know without context what that means about our model.  Further investigation is required.

Note that we cast the predictors as a matrix to work with `maxnet::predict()`.

```{r}
#| width: "100%"
prediction = predict(model, input_table, type = 'cloglog')
hist(prediction, xlab = "prediction value", main = "Basic Model")
```

#### How did it do?

We can use some utilities in the [maxnetic](https://github.com/BigelowLab/maxnetic) package to help us assess the model.  The `pAUC()` function will compute statistics, include a presence-only AUC value.  We need to pass it two items - the universe of predictions and the predictions for just the presence points. The plot below shows the Receiver Operator Curve (ROC) and the associated presence-only AUC value.

```{r, warning = FALSE}
ix = input_vector > 0
pauc = maxnetic::pAUC(prediction, prediction[ix])
plot(pauc, title = "v1.0 Basic Model")
```
Overall, this is telling us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line).  Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).


### Predict with rasters

We can also predict using raster inputs using our basic model. Let's read in rasters for each month of 2019, and then run a prediction for each month.

We provide a function `read_predictors()` that will read and bind the rasters together for you given the filtered databases and paths.  So, first we define the paths and filter the databases to point to just the months in 2019.

```{r}
dates = as.Date(c("2019-01-01", "2019-12-31"))

sst_path = "data/oisst"
sst_db = oisster::read_database(sst_path) |>
  dplyr::arrange(date) |>
  dplyr::filter(dplyr::between(date, dates[1], dates[2]))

wind_path = "data/nbs"
wind_db = nbs::read_database(wind_path) |>
  dplyr::arrange(date)|>
  dplyr::filter(dplyr::between(date, dates[1], dates[2]))

u_wind_db = wind_db |>
  dplyr::filter(param == "u_wind")|>
  dplyr::filter(dplyr::between(date, dates[1], dates[2]))

v_wind_db = wind_db |>
  dplyr::filter(param == "v_wind")|>
  dplyr::filter(dplyr::between(date, dates[1], dates[2]))

predictors = read_predictors(sst_db = sst_db,
                             u_wind_db = u_wind_db,
                             v_wind_db = v_wind_db)
predictors
```

You can see that we have the rasters in one object of three attributes (`sst`, `u_wind` and `v_wind`) each with 12 layers (Jan 2019 - Dec 2019). Now we can run the prediction.

```{r}
pred = predict(model, predictors, type = 'cloglog')
pred
```

Since we get a spatially mapped prediction back, we can plot it.

```{r}
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
  sf::st_crop(pred)

plot_coast = function() {
  plot(sf::st_geometry(coast), col = 'green', add = TRUE)
}
plot(pred, hook = plot_coast)
```

Well, that certainly looks appealing with higher likelihood of near shore observations occurring during the warmer months.

#### How did it do?

To compute an ROC and AUC for each month, we have a little bit of work to do.  We need to extract the observations locations for each month from the prediction maps.  These we can then plot.

:::{.callout-note}
We have to modify the date for each point to be the first date of each month. That's because our predictors are monthlies.
:::

```{r}
test_obs = obs |>
  dplyr::filter(dplyr::between(month_id, dates[1], dates[2])) |>
  dplyr::mutate(date = oisster::current_month(month_id))

x = stars::st_extract(pred, test_obs, time_column = 'date') |>
  print()
```

Finally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.

```{r}
y = x |>
  dplyr::mutate(month = factor(format(date, "%b"), levels = month.abb), 
                .before = 1) |>
  dplyr::select(dplyr::all_of(c("month", "pred", "date"))) |>
  dplyr::group_by(month) 

dplyr::count(y, month) |>
  print(n = 12)
```
  
Now how about one ROC plot for each month?  Yikes!  This requires a iterative approach, using `group_map()`, to compute the ROC for each month.  We then follow with plot wrapping by the [patchwork](https://patchwork.data-imaginist.com/articles/guides/assembly.html#functional-assembly) package.
  
```{r, warning = FALSE}
#| width: "100%"
paucs = dplyr::group_map(y, 
  function(tbl, key, pred_rasters = NULL){
    ix = key$month %in% month.abb
    x = dplyr::slice(pred_rasters, "time", ix)
    pauc = maxnetic::pAUC(x, tbl)
    plot(pauc,title = key$month, 
         xlab = "", ylab = "")
  }, pred_rasters = pred)

patchwork::wrap_plots(paucs, ncol = 4)
```

Hmmm.  That's surprising, yes?  Why during the summer months does our AUC go down when we have the most number of observations?  That seems counter intuitive. 

## Thinking about AUC

AUC is a diagnostic that provides a peek into the predictive power of a model. But what is it?  An analogy is fitting a straight line to a small set of observations verses a large set of observations and then comparing the correlation coefficients.  Here's a simple example using R's built-in dataset `cars` which is a data frame of 50 observations of speed and stopping distances of cars.  We'll compute a linear model for the entire data set, and then a second for a small subsample of the data. (Learn more about linear models in R [here](https://rseek.org/?q=linear+models).)

```{r}
data("cars")
cars = dplyr::as_tibble(cars)

all_fit = lm(dist ~ speed, data = cars)
summary(all_fit)
```

```{r}
set.seed(5)
sub_cars = dplyr::slice(cars, c(10, 28, 50))
sub_fit = lm(dist ~ speed, data = sub_cars)
summary(sub_fit)
```

You can see that the `r-squared` value is quite high for the smaller data set, but the model may not be predictive over the full range of data.  AUC is somewhat analogous to to `r-squared` in that a relatively low score does not necessarily suggest a poor model.

```{r}
ggplot2::ggplot(data = cars, ggplot2::aes(x = speed, y = dist)) +
  ggplot2::geom_point(color = "blue") +
  ggplot2::geom_abline(slope = coef(all_fit)[2], intercept = coef(all_fit)[1], color = "blue") + 
  ggplot2::geom_point(data = sub_cars, ggplot2::aes(x = speed, y = dist), 
                      color = "orange", shape = 1, size = 3) +
  ggplot2::geom_abline(slope = coef(sub_fit)[2], intercept = coef(sub_fit)[1], color = "orange")
```