This folder contains code used to fit species ditribution models for chagas vectors as described here: https://www.biorxiv.org/content/10.1101/738310v1. Note that the Article is in Revision at PLOS NTD and was updated since.
Disclaimer: Since our workflow relies on some data (particularly the satellite data) that require users to sign up to a series of separate open access licences with various organisations, we created a demo (within the README of the repository) that can be run directly (the code calls some functions that download presence data as well as environmental covariates). This demo illustrates the workflow for one species. Further, all code used for analyses in the manuscript is also available in the repository and described in the README.
If anything is unclear or not working, please don't hesitate to open up an issue.
This section illustrates the workflow presented in Bender, Python, Lindsay, et al. (2019). Note that full details are given in the manuscript and this code repository (see also description of the Folder structure below the demo analysis).
# for first run install packages from this repository
# devtools::install("mastergrids", dependencies = TRUE)
# devtools::install("tcruziutils", dependencies = TRUE)
# libraries
devtools::load_all("mastergrids")
devtools::load_all("tcruziutils")
library(dplyr)
library(purrr)
# viz
library(ggplot2)
theme_set(theme_bw())
# modeling
library(scam)
# defaults
# this is just the maximum extent of the endemic zone
# used to crop environmental variables, etc.
extent_tcruzi <- tcruziutils::extent_tcruzi
# country polygons
data(wrld_simpl, package = "maptools")
countries <- wrld_simpl %>% raster::crop(extent_tcruzi)
# colors
Set1 <- RColorBrewer::brewer.pal(9, "Set1")
## set seed as generation of folds is random
set.seed(101850)
- In the publication we used the data from Ceccarelli, Balsalobre, Medone, et al. (2018) and some additional data (but not much). The former is openly available (figshare download link (EXCEL file))
- The code below downloads, imports the file and performs some preprocessing
library(httr)
GET("https://ndownloader.figshare.com/files/10302303",
write_disk(tf <- tempfile(fileext = ".xls")))
## Response [https://s3-eu-west-1.amazonaws.com/pstorage-npg-968563215/10302303/SciDataData_CitationCeccarellietal.2018.xls]
## Date: 2020-05-10 16:54
## Status: 200
## Content-Type: binary/octet-stream
## Size: 5.74 MB
## <ON DISK> /tmp/Rtmpbi1Llj/file32cc611e8598.xls
df <- readxl::read_excel(tf, 1L, na = c("", " ", "NR", "NA"))
presence_vector <- df %>%
mutate(
area = 25L, # analysis will be performed at 5x5 square km resolution
Start_year = as.integer(substr(year, 1L, 4L)),
End_year = as.integer(substr(year, 6L, 9L)))
presence_vector$reference <- df[[17]]
presence_vector[, 17] <- NULL
presence_vector <- presence_vector %>%
mutate(
Public_year = as.integer(stringr::str_extract(.data$reference, "[0-9]{4}")),
Public_year = ifelse(Public_year > 2018L, NA, Public_year),
End_year = ifelse(is.na(End_year), Start_year, End_year)) %>%
select(scientificName, Start_year, End_year, Public_year, area,
individualCount, starts_with("decimal"), habitat, reference) %>%
rename(
species = scientificName,
Latitude = decimalLatitude,
Longitude = decimalLongitude,
n_observed = individualCount) %>%
rename_all(~tolower(sub(" ", "_", .))) %>%
select(-reference)
- imputation of the "year" variable if missing
presence_vector <- presence_vector %>%
rename_all(tolower) %>%
filter(!(is.na(start_year) & is.na(public_year))) %>%
filter(start_year <= end_year | is.na(start_year) | is.na(end_year)) %>%
filter(start_year <= public_year | is.na(start_year) | is.na(public_year)) %>%
filter(start_year >= 2000 | is.na(start_year)) %>%
filter(public_year >= 2000 | is.na(public_year)) %>%
filter(end_year <= public_year | is.na(end_year) | is.na(public_year)) %>%
mutate(
imputed = is.na(start_year),
id = row_number())
# impute start end year
lm_start <- scam(start_year ~ s(public_year, bs = "mpi"),
data = presence_vector)
lm_end <- scam(end_year ~ s(start_year, bs = "mpi") +
s(I(public_year - start_year), bs = "mpi"), data = presence_vector)
presence_vector <- presence_vector %>%
mutate(start_year = ifelse(is.na(start_year),
as.integer(floor(predict (lm_start, .))), start_year)) %>%
mutate(end_year = ifelse(is.na(end_year),
as.integer(ceiling(predict(lm_end, .))), end_year)) %>%
# few imputed end year larger than public year
mutate(end_year = pmin(end_year, public_year)) %>%
# set end_year to start year if still na (not imputed b/c public year = NA)
mutate(end_year = if_else(is.na(end_year), start_year, end_year))
- remove observations before 2000
presence_vector <- presence_vector %>%
filter(start_year >= 2000)
- in the publication we used raster files available from Servers of the Malaria Atlas Project
- the environmental variables were extracted for each observation based on (imputed) year of observation
- here, we use the
raster::getData()
function to obtain the environmental covariates (we use the same covariate layers for all years)
covs <- raster::getData("worldclim", var = "bio", res = 5)[[1:19]]
env_grids <- raster::crop(covs, extent_tcruzi)
# combine presence + covariate data
covs_ex <- raster::extract(env_grids, as_spatial(presence_vector))
presence_vector <- cbind(presence_vector, covs_ex)
The procedure below was iterated over all species (with enough observations) and consists of the following steps:
- Create presence/background column for the species of interest
(here
Panstrongylus megistus
) - Split data into
- "training/test" (using spatial blocks/spatial CV), used later for model selection (folds 1-4) and to asses the models "extrapolation performance"
- "evaluation" (random subsample), used later to asses the models "interpolation performance"
-
The spatial blocks/spatial CV is set up using function
get_sp_folds
which is a wrapper around the packageblockCV
(Valavi, Elith, Lahoz-Monfort, et al., 2018) -
Fit the model on "training data" (spatial folds 1-4). Here, for conciseness we perform a very narrow model comparison of only two competing models.
-
Asses the models performance on data not used during model selection/fit
presence_vector <- presence_vector %>%
mutate(presence = 1L * (species == "Panstrongylus megistus"))
table(presence_vector$presence)
##
## 0 1
## 13412 1184
it_pv <- presence_vector %>%
rsample::initial_split(strata = "species", prop = 4 / 5)
# evaluation data (only used at the very end for "interpolation error" estimate
evaluation_df <- rsample::testing(it_pv)
# train test is split in 5 spatial folds
train_test_df <- rsample::training(it_pv)
sp_tt <- as_spatial(train_test_df)
raster::crs(countries) <- raster::crs(sp_tt)
sp_fold_pans_meg <- get_sp_folds(
data = sp_tt,
species = "Panstrongylus megistus",
mask = countries,
species_var = "species",
n_blocks = 50,# number of blocks
k = 5, # number of folds
width = 5, # width of extended hull (outside of observed )
calc_range = FALSE)
## The best folds was in iteration 84:
## train_0 train_1 test_0 test_1
## 1 2800 795 485 144
## 2 2574 763 711 176
## 3 2505 706 780 233
## 4 2555 718 730 221
## 5 2706 774 579 165
# graphical depiction of spatial folds + presence/background
# within extended hull
tmap_cv(
sp_fold_pans_meg,
countries = countries)
- In the paper we run a CV for different model specifications on folds 1 - 4 (fit on 3 folds, evaluation on 4th fold)
- best setting/modell w.r.t. average performance for the 4 CV runs is then refit on folds 1-4 and evaluated using data from fold 5 (this is AUC value reported in the table)
- this model was also used for evaluation on the random hold-out data (
evaluation_df
) above - After evaluation the final model was obtained by refitting this model on all available data (folds 1 -5 and random evaluation data)
- Here we only compare two different models (one with smooth covariate effects, one with linear effects) to keep code and runtime short.
- create
formula
that specifies the linear/additive predictor usingmake_gam_formula
# create formula
# make_gam_formula: only vars with unique(var) > 20 used
# formula with smooth effects
mod_formula <-
make_gam_formula(
as.data.frame(sp_fold_pans_meg$train),
candidates=names(env_grids),
type = "smooth") %>%
add_gp() # add gaussian process smooth (see ?mgcv::gp.smooth)
mod_formula
## presence ~ s(bio1, by = NA) + s(bio2, by = NA) + s(bio3, by = NA) +
## s(bio4, by = NA) + s(bio5, by = NA) + s(bio6, by = NA) +
## s(bio7, by = NA) + s(bio8, by = NA) + s(bio9, by = NA) +
## s(bio10, by = NA) + s(bio11, by = NA) + s(bio12, by = NA) +
## s(bio13, by = NA) + s(bio14, by = NA) + s(bio15, by = NA) +
## s(bio16, by = NA) + s(bio17, by = NA) + s(bio18, by = NA) +
## s(bio19, by = NA) + s(longitude, latitude, bs = "gp")
## <environment: 0x55fed13bab20>
# formula with linear effects
mod_formula2 <-
make_gam_formula(
as.data.frame(sp_fold_pans_meg$train),
candidates=names(env_grids),
type = "linear")
mod_formula2
## presence ~ bio1 + bio2 + bio3 + bio4 + bio5 + bio6 + bio7 + bio8 +
## bio9 + bio10 + bio11 + bio12 + bio13 + bio14 + bio15 + bio16 +
## bio17 + bio18 + bio19
## <environment: 0x55fecbf27f60>
- Fit the GAM using
mgcv
mgcv::bam
could be replaced bymgcv::gam
, but has less demands w.r.t. to memory reqiurements and offers significant speed-up, especially withdiscrete = TRUE
option (Wood, Li, Shaddick, et al., 2017)- set
discrete = FALSE
when calling thepredict
function to obtain smoother predictions
# see ?mgcv::bam for references
folds <- 1:4
train <- as.data.frame(sp_fold_pans_meg$train)
## create 4 models (each without one of the folds)
# smooth effects
models_smooth <- purrr::map(folds,
~ mgcv::bam(mod_formula, as.data.frame(train[train$fold != .x, ]),
family = "binomial", method = "fREML", discrete = TRUE, gamma = 2L))
# linear effects
models_linear <- purrr::map(folds,
~ mgcv::bam(mod_formula2, as.data.frame(train[train$fold != .x, ]),
family = "binomial", method = "fREML", gamma = 2L))
# auc for each model
auc_smooth <- map_dbl(
folds,
~{
predicted <- predict(models_smooth[[.x]], train[train$fold == .x, ], discrete = FALSE)
observed <- train[train$fold == .x, "presence"]
MLmetrics::AUC(predicted, observed)
})
# AUC linear models
auc_linear <- map_dbl(
folds,
~{
predicted <- predict(models_linear[[.x]], train[train$fold == .x, ], discrete = FALSE)
observed <- train[train$fold == .x, "presence"]
MLmetrics::AUC(predicted, observed)
})
# comparison
mean(auc_smooth)
## [1] 0.8603036
mean(auc_linear)
## [1] 0.8401573
# in this case we would select the model with smooth effects of covariates
## Refit model for evaluation
mod <- mgcv::bam(
formula = mod_formula,
data = as.data.frame(sp_fold_pans_meg$train),
family = binomial(),
method = "fREML", # fast REML
discrete = TRUE, # speeds up computation
gamma = 2L)
prediction_test <- predict(
mod,
newdata = as.data.frame(sp_fold_pans_meg$test),
type = "response",
discrete = FALSE)
prediction_eval_df <- predict(
mod,
newdata = evaluation_df,
type = "response",
discrete = FALSE
)
# evaluation w.r.t. to extrapolation (i.e. fold 5)
MLmetrics::AUC(prediction_test, sp_fold_pans_meg$test$presence)
## [1] 0.8935416
# evaluation w.r.t. interpolation (i.e. random hold-out data)
MLmetrics::AUC(prediction_eval_df, evaluation_df$presence)
## [1] 0.9632264
# extract data points within extended hull of Panstrongylus megistus
df_all <- presence_vector %>%
as_spatial() %>%
raster::crop(sp_fold_pans_meg$hull)
# refit model with all data
mod_all <- update(mod, data = as.data.frame(df_all))
- Final prediction:
# newdata with covariate values for each 5x5 pixel within hull of Panstrongylus megistus
env_grids <- env_grids %>%
raster::crop(sp_fold_pans_meg$hull) %>%
raster::mask(sp_fold_pans_meg$hull)
ndf <- grid_to_df(env_grids)
# calculate predictions, set discrete = FALSE to obtain smoother predictions
prediction <- predict(mod_all, newdata = ndf, type = "link", discrete = FALSE,
se = TRUE)
ndf$prediction <- exp(prediction$fit)/(1 + exp(prediction$fit))
# calculate CI
ndf$se <- prediction$se
ci_lower <- prediction$fit - 2*prediction$se
ci_upper <- prediction$fit + 2*prediction$se
ndf$ci_lower <- exp(ci_lower)/(1 + exp(ci_lower))
ndf$ci_upper <- exp(ci_upper)/(1 + exp(ci_upper))
ndf$ci <- ndf$ci_upper - ndf$ci_lower
# retransform df to raster for plotting
pred_raster <- df_to_grid(ndf, env_grids[[1]], "prediction")
tm_shape(raster::crop(countries, raster::extent(sp_fold_pans_meg$hull))) +
tm_borders() +
tm_shape(pred_raster) +
tm_raster(style = "cont", palette = viridis::magma(1e3),
breaks = seq(0, 1, by = .2), alpha = .8)
- Bivariate map (this is not well implemented in the moment in R), manual hacks required (alternatively, could predict upper/lower CI and plot CI alongside prediction)
## Note, this is just for illustration. Specific cut-offs and color palette
# for bivariate maps were used in the publication
# create map + legend
# cut points could be specified
bivar_map <- tm_bivariate(ndf, env_grids[[1]], sp_fold_pans_meg)
# draw figure, x and y control position of legend
tm_bivar_draw(bivar_map, x = .55, y = .05)
Bender, A, A. Python, S. Lindsay, et al. (2019). "Modelling geospatial distributions of the triatomine vectors of Trypanosoma cruzi in Latin America". En. In: bioRxiv, p. 738310. DOI: 10.1101/738310. URL: https://www.biorxiv.org/content/10.1101/738310v1 (visited on Aug. 17, 2019).
Ceccarelli, S, A. Balsalobre, P. Medone, et al. (2018). "DataTri, a database of American triatomine species occurrence". En. In: Scientific Data 5, p. 180071. ISSN: 2052-4463. DOI: 10.1038/sdata.2018.71. URL: https://www.nature.com/articles/sdata201871 (visited on Nov. 14, 2018).
Valavi, R, J. Elith, J. J. Lahoz-Monfort, et al. (2018). "blockCV: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models". In: Methods in Ecology and Evolution 0.0. ISSN: 2041-210X. DOI: 10.1111/2041-210X.13107. URL: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107 (visited on Feb. 13, 2019).
Wood, S. N, Z. Li, G. Shaddick, et al. (2017). "Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data". In: Journal of the American Statistical Association 112.519, pp. 1199-1210. ISSN: 0162-1459. DOI: 10.1080/01621459.2016.1195744. URL: https://doi.org/10.1080/01621459.2016.1195744 (visited on May. 31, 2018).
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] sf_0.9-3 httr_1.4.1 scam_1.2-5 mgcv_1.8-31
## [5] nlme_3.1-147 ggplot2_3.3.0 purrr_0.3.4 dplyr_0.8.99.9002
## [9] tcruziutils_0.0.7 mastergrids_0.0.3 RefManageR_1.2.12 knitr_1.28
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 spam_2.5-1 backports_1.1.6
## [4] lwgeom_0.2-3 plyr_1.8.6 lazyeval_0.2.2
## [7] sp_1.4-1 splines_4.0.0 crosstalk_1.1.0.1
## [10] listenv_0.8.0 leaflet_2.0.3 gstat_2.0-5
## [13] usethis_1.6.1 digest_0.6.25 foreach_1.5.0
## [16] htmltools_0.4.0 viridis_0.5.1 fansi_0.4.1
## [19] magrittr_1.5 checkmate_2.0.0 memoise_1.1.0
## [22] MLmetrics_1.1.1 tensor_1.5 remotes_2.1.1
## [25] recipes_0.1.12 globals_0.12.5 gower_0.2.1
## [28] xts_0.12-0 rsample_0.0.6 prettyunits_1.1.1
## [31] colorspace_1.4-1 rgdal_1.4-8 xfun_0.13
## [34] leafem_0.1.1 callr_3.4.3 crayon_1.3.4
## [37] jsonlite_1.6.1 spatstat_1.63-3 spatstat.data_1.4-3
## [40] zoo_1.8-8 survival_3.1-12 iterators_1.0.12
## [43] glue_1.4.0 stars_0.4-1 polyclip_1.10-0
## [46] pals_1.6 gtable_0.3.0 ipred_0.9-9
## [49] pkgbuild_1.0.8 maps_3.3.0 abind_1.4-5
## [52] scales_1.1.0 mvtnorm_1.1-0 DBI_1.1.0
## [55] bibtex_0.4.2.2 Rcpp_1.0.4.6 viridisLite_0.3.0
## [58] progress_1.2.2 units_0.6-6 mapproj_1.2.7
## [61] dotCall64_1.0-0 Formula_1.2-3 intervals_0.15.2
## [64] stats4_4.0.0 lava_1.6.7 prodlim_2019.11.13
## [67] dismo_1.1-4 htmlwidgets_1.5.1 FNN_1.1.3
## [70] RColorBrewer_1.1-2 geosphere_1.5-10 ellipsis_0.3.0
## [73] farver_2.0.3 reshape_0.8.8 pkgconfig_2.0.3
## [76] XML_3.99-0.3 nnet_7.3-14 deldir_0.1-25
## [79] caret_6.0-86 tidyselect_1.0.0 rlang_0.4.6.9000
## [82] reshape2_1.4.4 tmaptools_3.0 cellranger_1.1.0
## [85] munsell_0.5.0 tools_4.0.0 xgboost_1.0.0.2
## [88] cli_2.0.2 generics_0.0.2 devtools_2.3.0
## [91] evaluate_0.14 stringr_1.4.0 goftest_1.2-2
## [94] ModelMetrics_1.2.2.2 processx_3.4.2 leafsync_0.1.0
## [97] fs_1.4.1 timereg_1.9.4 blockCV_2.1.1
## [100] pec_2019.11.03 pbapply_1.4-2 future_1.17.0
## [103] xml2_1.3.2 compiler_4.0.0 rstudioapi_0.11
## [106] curl_4.3 png_0.1-7 e1071_1.7-3
## [109] testthat_2.3.2 spatstat.utils_1.17-0 spacetime_1.2-3
## [112] tibble_3.0.1 stringi_1.4.6 highr_0.8
## [115] ps_1.3.3 desc_1.2.0 fields_10.3
## [118] rgeos_0.5-3 lattice_0.20-41 Matrix_1.2-18
## [121] classInt_0.4-3 vctrs_0.3.0 pillar_1.4.4
## [124] lifecycle_0.2.0 furrr_0.1.0 pammtools_0.2.2
## [127] cowplot_1.0.0 data.table_1.12.8 raster_3.1-5
## [130] R6_2.4.1 gridExtra_2.3 KernSmooth_2.23-17
## [133] sessioninfo_1.1.1 codetools_0.2-16 dichromat_2.0-0
## [136] MASS_7.3-51.6 assertthat_0.2.1 pkgload_1.0.2
## [139] rprojroot_1.3-2 withr_2.2.0 hms_0.5.3
## [142] parallel_4.0.0 grid_4.0.0 rpart_4.1-15
## [145] timeDate_3043.102 tidyr_1.0.3 class_7.3-17
## [148] automap_1.0-14 tmap_3.0 pROC_1.16.2
## [151] numDeriv_2016.8-1.1 lubridate_1.7.8 base64enc_0.1-3
Overview of project files and folders:
-
Preliminary notes:
- The initial steps of this project requiere access to the
mastergrids
folder at Maleria Atlas Project and will thus not be fully reproducible
- The initial steps of this project requiere access to the
-
mastergrids
: AnR
package that facilitates the import of environmental raster data from themastergrids
folder at BDI MAP and some utility functions to transform rasters to data frames and vice versa (the data import won't work outside of the BDI/without connection to mastergrids). Also contains two functionsgrid_to_df
anddf_to_grid
which convert RasterLayer/RasterBrick object to a data frame and vice versa. -
tcruziutils
: AnR
package that facilitates all steps of the analysis. Most functions are specific to this project and should not be used for general purpose projects. It can be loaded at the beginning of the script usingdevtools::load_all("<path to project>/tcruziutils")
or installed viadevtools::install("<path to project>/tcruziutils")
and then loaded as usual withlibrary(tcruziutils)
-
infection data
(not included): Contains data on infection prevalence and presence in vectors and humansExternal vector database
: Additional data set (compiled by another research group that contains presence only data on different vector species)
-
endemic zone
(not included): Contains a shape-file that defines the endemic zone of the disease ("mask"). Can be used to crop environmental grid data and other spatial objects. The spatial extent defined by this mask is stored in thetcruziutils
package asraster::extent
andsp::bbox
objects for convenience (seetcruziutils::extent_tcruzi
andtcruziutils::bbox_tcruzi
). -
polygon boundaries
(not included): Shapefiles containing polygon boundaries on administrative district levels 1 and 2. Used to extract location/area and polygon information based on the GAUL code. Cropped according to endemic zone and stored asshp_admin_1.Rds
andshp_admin_2.Rds
in thepreprocessing
folder -
preprocessing
: This folder contains the main pre-processing scripts and stores the pre-processed data sets that will be used for modeling-
import.R
: Initial data import of the presence/prevalence. For observations recorded on a polygon level, also adds the centroid and area information of the respective polygon to the data (see functionadd_spat_dat
intcruziutils
package) -
prep-for-modeling.Rmd
: Builds on theimport.R
script and pre-processes the initially imported data for modeling purposes. This includes:- application of inclusion/exclusion criteria
- imputation of (some) missing data
- addition of covariate layers to the observed presence/absence data
(based on coordinates; see
raster::extract
andtcruziutils::add_grid_data
) - Splits the data set in train/test data (and possibly evaluation
data). Additionally creates block-wise cross-validation scheme,
stored as
fold
variable in the original data. - also produces some additional visualizations (stored in
tcruzi/figures
aspdf
andpng
)
-
-
modeling
:vector_occurrence-sdm
: Folder containing Species Distribution Models (SDM) based on presense only data:- See README therein
Overview of project files and folders:
-
Preliminary notes:
- The initial steps of this project requiere access to the
mastergrids
folder at Maleria Atlas Project and will thus not be fully reproducible
- The initial steps of this project requiere access to the
-
mastergrids
: AnR
package that facilitates the import of environmental raster data from themastergrids
folder at BDI MAP and some utility functions to transform rasters to data frames and vice versa (the data import won't work outside of the BDI/without connection to mastergrids). Also contains two functionsgrid_to_df
anddf_to_grid
which convert RasterLayer/RasterBrick object to a data frame and vice versa. -
tcruziutils
: AnR
package that facilitates all steps of the analysis. Most functions are specific to this project and should not be used for general purpose projects. It can be loaded at the beginning of the script usingdevtools::load_all("<path to project>/tcruziutils")
or installed viadevtools::install("<path to project>/tcruziutils")
and then loaded as usual withlibrary(tcruziutils)
-
infection data
(not included): Contains data on infection prevalence and presence in vectors and humansExternal vector database
: Additional data set (compiled by another research group that contains presence only data on different vector species)
-
endemic zone
(not included): Contains a shape-file that defines the endemic zone of the disease ("mask"). Can be used to crop environmental grid data and other spatial objects. The spatial extent defined by this mask is stored in thetcruziutils
package asraster::extent
andsp::bbox
objects for convenience (seetcruziutils::extent_tcruzi
andtcruziutils::bbox_tcruzi
). -
polygon boundaries
(not included): Shapefiles containing polygon boundaries on administrative district levels 1 and 2. Used to extract location/area and polygon information based on the GAUL code. Cropped according to endemic zone and stored asshp_admin_1.Rds
andshp_admin_2.Rds
in thepreprocessing
folder -
preprocessing
: This folder contains the main pre-processing scripts and stores the pre-processed data sets that will be used for modeling-
import.R
: Initial data import of the presence/prevalence. For observations recorded on a polygon level, also adds the centroid and area information of the respective polygon to the data (see functionadd_spat_dat
intcruziutils
package) -
prep-for-modeling.Rmd
: Builds on theimport.R
script and pre-processes the initially imported data for modeling purposes. This includes:- application of inclusion/exclusion criteria
- imputation of (some) missing data
- addition of covariate layers to the observed presence/absence data
(based on coordinates; see
raster::extract
andtcruziutils::add_grid_data
) - Splits the data set in train/test data (and possibly evaluation
data). Additionally creates block-wise cross-validation scheme,
stored as
fold
variable in the original data. - also produces some additional visualizations (stored in
tcruzi/figures
aspdf
andpng
)
-
-
modeling
:vector_occurrence-sdm
: Folder containing Species Distribution Models (SDM) based on presense only data:- See README therein