Code used for species distribution modelling of chagas vectors

Species distribution models of chagas vectors

This folder contains code used to fit species ditribution models for chagas vectors as described here: Note that the Article is in Revision at PLOS NTD and was updated since.

Disclaimer: Since our workflow relies on some data (particularly the satellite data) that require users to sign up to a series of separate open access licences with various organisations, we created a demo (within the README of the repository) that can be run directly (the code calls some functions that download presence data as well as environmental covariates). This demo illustrates the workflow for one species. Further, all code used for analyses in the manuscript is also available in the repository and described in the README.

If anything is unclear or not working, please don't hesitate to open up an issue.

Examplary analysis

This section illustrates the workflow presented in Bender, Python, Lindsay, et al. (2019). Note that full details are given in the manuscript and this code repository (see also description of the Folder structure below the demo analysis).

# for first run install packages from this repository
# devtools::install("mastergrids", dependencies = TRUE)
# devtools::install("tcruziutils", dependencies = TRUE)
# libraries
# viz
# modeling

# defaults
# this is just the maximum extent of the endemic zone
# used to crop environmental variables, etc.
extent_tcruzi   <- tcruziutils::extent_tcruzi
# country polygons
data(wrld_simpl, package = "maptools")
countries <- wrld_simpl %>% raster::crop(extent_tcruzi)
# colors
Set1   <- RColorBrewer::brewer.pal(9, "Set1")
## set seed as generation of folds is random

Data Import

  write_disk(tf <- tempfile(fileext = ".xls")))
## Response []
##   Date: 2020-05-10 16:54
##   Status: 200
##   Content-Type: binary/octet-stream
##   Size: 5.74 MB
## <ON DISK>  /tmp/Rtmpbi1Llj/file32cc611e8598.xls
df <- readxl::read_excel(tf, 1L, na = c("", " ", "NR", "NA"))
presence_vector <- df %>%
    area = 25L, # analysis will be performed at 5x5 square km resolution
    Start_year = as.integer(substr(year, 1L, 4L)),
    End_year   = as.integer(substr(year, 6L, 9L)))
presence_vector$reference <- df[[17]]
presence_vector[, 17] <- NULL
presence_vector <- presence_vector %>%
    Public_year = as.integer(stringr::str_extract(.data$reference, "[0-9]{4}")),
    Public_year = ifelse(Public_year > 2018L, NA, Public_year),
    End_year    = ifelse(, Start_year, End_year)) %>%
  select(scientificName, Start_year, End_year, Public_year, area,
    individualCount, starts_with("decimal"), habitat, reference) %>%
    species    = scientificName,
    Latitude   = decimalLatitude,
    Longitude  = decimalLongitude,
    n_observed = individualCount) %>%
  rename_all(~tolower(sub(" ", "_", .))) %>%
  • imputation of the "year" variable if missing
presence_vector <- presence_vector %>%
  rename_all(tolower) %>%
  filter(!( & %>%
  filter(start_year <= end_year | | %>%
  filter(start_year <= public_year | | %>%
  filter(start_year >= 2000 | %>%
  filter(public_year >= 2000 | %>%
  filter(end_year <= public_year | | %>%
    imputed =,
    id      = row_number())
# impute start end year
lm_start <- scam(start_year ~  s(public_year, bs = "mpi"),
  data = presence_vector)
lm_end <- scam(end_year ~ s(start_year, bs = "mpi") +
  s(I(public_year - start_year), bs = "mpi"), data = presence_vector)
presence_vector <- presence_vector %>%
  mutate(start_year = ifelse(,
    as.integer(floor(predict (lm_start, .))), start_year)) %>%
  mutate(end_year = ifelse(,
    as.integer(ceiling(predict(lm_end, .))), end_year)) %>%
  # few imputed end year larger than public year
  mutate(end_year = pmin(end_year, public_year))  %>%
  # set end_year to start year if still na (not imputed b/c public year = NA)
  mutate(end_year = if_else(, start_year, end_year))
  • remove observations before 2000
presence_vector <- presence_vector %>%
  filter(start_year >= 2000)

Adding spatial/environmental covariates to data set

  • in the publication we used raster files available from Servers of the Malaria Atlas Project
  • the environmental variables were extracted for each observation based on (imputed) year of observation
  • here, we use the raster::getData() function to obtain the environmental covariates (we use the same covariate layers for all years)
covs <- raster::getData("worldclim", var = "bio", res = 5)[[1:19]]
env_grids <- raster::crop(covs, extent_tcruzi)
# combine presence + covariate data
covs_ex <- raster::extract(env_grids, as_spatial(presence_vector))
presence_vector <- cbind(presence_vector, covs_ex)

Workflow for one species

The procedure below was iterated over all species (with enough observations) and consists of the following steps:

  1. Create presence/background column for the species of interest (here Panstrongylus megistus)
  2. Split data into
  • "training/test" (using spatial blocks/spatial CV), used later for model selection (folds 1-4) and to asses the models "extrapolation performance"
  • "evaluation" (random subsample), used later to asses the models "interpolation performance"
  1. The spatial blocks/spatial CV is set up using function get_sp_folds which is a wrapper around the package blockCV (Valavi, Elith, Lahoz-Monfort, et al., 2018)

  2. Fit the model on "training data" (spatial folds 1-4). Here, for conciseness we perform a very narrow model comparison of only two competing models.

  3. Asses the models performance on data not used during model selection/fit

1. Create presence/background dummy

presence_vector <- presence_vector %>%
  mutate(presence = 1L * (species == "Panstrongylus megistus"))
##     0     1
## 13412  1184

2. Split data

it_pv <- presence_vector %>%
  rsample::initial_split(strata = "species", prop = 4 / 5)
# evaluation data (only used at the very end for "interpolation error" estimate
evaluation_df <- rsample::testing(it_pv)
# train test is split in 5 spatial folds
train_test_df <- rsample::training(it_pv)

3. Create spatial folds

sp_tt <- as_spatial(train_test_df)
raster::crs(countries) <- raster::crs(sp_tt)
sp_fold_pans_meg <- get_sp_folds(
  data        = sp_tt,
  species     = "Panstrongylus megistus",
  mask        = countries,
  species_var = "species",
  n_blocks    = 50,# number of blocks
  k           = 5, # number of folds
  width       = 5, # width of extended hull (outside of observed )
  calc_range  = FALSE)
## The best folds was in iteration 84:
##   train_0 train_1 test_0 test_1
## 1    2800     795    485    144
## 2    2574     763    711    176
## 3    2505     706    780    233
## 4    2555     718    730    221
## 5    2706     774    579    165
# graphical depiction of spatial folds + presence/background
# within extended hull
  countries = countries)

plot of chunk unnamed-chunk-10

4. Fit the model

  • In the paper we run a CV for different model specifications on folds 1 - 4 (fit on 3 folds, evaluation on 4th fold)
  • best setting/modell w.r.t. average performance for the 4 CV runs is then refit on folds 1-4 and evaluated using data from fold 5 (this is AUC value reported in the table)
  • this model was also used for evaluation on the random hold-out data (evaluation_df) above
  • After evaluation the final model was obtained by refitting this model on all available data (folds 1 -5 and random evaluation data)
  • Here we only compare two different models (one with smooth covariate effects, one with linear effects) to keep code and runtime short.

Fit the GAM

  • create formula that specifies the linear/additive predictor using make_gam_formula
# create formula
# make_gam_formula: only vars with unique(var) > 20 used
# formula with smooth effects
mod_formula <-
    type = "smooth") %>%
  add_gp() # add gaussian process smooth (see ?mgcv::gp.smooth)
## presence ~ s(bio1, by = NA) + s(bio2, by = NA) + s(bio3, by = NA) +
##     s(bio4, by = NA) + s(bio5, by = NA) + s(bio6, by = NA) +
##     s(bio7, by = NA) + s(bio8, by = NA) + s(bio9, by = NA) +
##     s(bio10, by = NA) + s(bio11, by = NA) + s(bio12, by = NA) +
##     s(bio13, by = NA) + s(bio14, by = NA) + s(bio15, by = NA) +
##     s(bio16, by = NA) + s(bio17, by = NA) + s(bio18, by = NA) +
##     s(bio19, by = NA) + s(longitude, latitude, bs = "gp")
## <environment: 0x55fed13bab20>
# formula with linear effects
mod_formula2 <-
    type = "linear")
## presence ~ bio1 + bio2 + bio3 + bio4 + bio5 + bio6 + bio7 + bio8 +
##     bio9 + bio10 + bio11 + bio12 + bio13 + bio14 + bio15 + bio16 +
##     bio17 + bio18 + bio19
## <environment: 0x55fecbf27f60>
  • Fit the GAM using mgcv
  • mgcv::bam could be replaced by mgcv::gam, but has less demands w.r.t. to memory reqiurements and offers significant speed-up, especially with discrete = TRUE option (Wood, Li, Shaddick, et al., 2017)
  • set discrete = FALSE when calling the predict function to obtain smoother predictions
# see ?mgcv::bam for references
folds <- 1:4
train <-$train)
##  create 4 models (each without one of the folds)
# smooth effects
models_smooth <- purrr::map(folds,
  ~ mgcv::bam(mod_formula,[train$fold != .x, ]),
      family = "binomial", method = "fREML", discrete = TRUE, gamma = 2L))
# linear effects
models_linear <- purrr::map(folds,
  ~ mgcv::bam(mod_formula2,[train$fold != .x, ]),
      family = "binomial", method = "fREML", gamma = 2L))
# auc for each model
auc_smooth <- map_dbl(
    predicted <- predict(models_smooth[[.x]], train[train$fold == .x, ], discrete = FALSE)
    observed <- train[train$fold == .x, "presence"]
    MLmetrics::AUC(predicted, observed)
# AUC linear models
auc_linear <- map_dbl(
    predicted <- predict(models_linear[[.x]], train[train$fold == .x, ], discrete = FALSE)
    observed <- train[train$fold == .x, "presence"]
    MLmetrics::AUC(predicted, observed)
# comparison
## [1] 0.8603036
## [1] 0.8401573
# in this case we would select the model with smooth effects of covariates

## Refit model for evaluation
mod <- mgcv::bam(
  formula  = mod_formula,
  data     =$train),
  family   = binomial(),
  method   = "fREML", # fast REML
  discrete = TRUE, # speeds up computation
  gamma    = 2L)

5. Evaluate the model

prediction_test <- predict(
  newdata  =$test),
  type     = "response",
  discrete = FALSE)

prediction_eval_df <- predict(
  newdata = evaluation_df,
  type = "response",
  discrete = FALSE

# evaluation w.r.t. to extrapolation (i.e. fold 5)
MLmetrics::AUC(prediction_test, sp_fold_pans_meg$test$presence)
## [1] 0.8935416
# evaluation w.r.t. interpolation (i.e. random hold-out data)
MLmetrics::AUC(prediction_eval_df, evaluation_df$presence)
## [1] 0.9632264

6. Refit model on all data

# extract data points within extended hull of Panstrongylus megistus
df_all <- presence_vector %>%
  as_spatial() %>%
# refit model with all data
mod_all <- update(mod, data =

7. Visualize results

  • Final prediction:
# newdata with covariate values for each 5x5 pixel within hull of Panstrongylus megistus
env_grids <- env_grids %>%
  raster::crop(sp_fold_pans_meg$hull) %>%
ndf <- grid_to_df(env_grids)
# calculate predictions, set discrete = FALSE to obtain smoother predictions
prediction <- predict(mod_all, newdata = ndf, type = "link", discrete = FALSE,
  se = TRUE)
ndf$prediction <- exp(prediction$fit)/(1 + exp(prediction$fit))
# calculate CI
ndf$se <- prediction$se
ci_lower <- prediction$fit - 2*prediction$se
ci_upper <- prediction$fit + 2*prediction$se
ndf$ci_lower <- exp(ci_lower)/(1 + exp(ci_lower))
ndf$ci_upper <- exp(ci_upper)/(1 + exp(ci_upper))
ndf$ci <- ndf$ci_upper - ndf$ci_lower
# retransform df to raster for plotting
pred_raster <- df_to_grid(ndf, env_grids[[1]], "prediction")
tm_shape(raster::crop(countries, raster::extent(sp_fold_pans_meg$hull))) +
  tm_borders() +
  tm_shape(pred_raster) +
  tm_raster(style = "cont", palette = viridis::magma(1e3),
    breaks = seq(0, 1, by = .2), alpha = .8)

plot of chunk unnamed-chunk-16

  • Bivariate map (this is not well implemented in the moment in R), manual hacks required (alternatively, could predict upper/lower CI and plot CI alongside prediction)
## Note, this is just for illustration. Specific cut-offs and color palette
# for bivariate maps were used in the publication
# create map + legend
# cut points could be specified
bivar_map <- tm_bivariate(ndf, env_grids[[1]], sp_fold_pans_meg)
# draw figure, x and y control position of legend
tm_bivar_draw(bivar_map, x = .55, y = .05)

plot of chunk unnamed-chunk-17


