-
Notifications
You must be signed in to change notification settings - Fork 41
/
06-Soilmapping_using_mla.Rmd
executable file
·1061 lines (833 loc) · 52.4 KB
/
06-Soilmapping_using_mla.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Machine Learning Algorithms for soil mapping {#soilmapping-using-mla}
*Edited by: T. Hengl*
## Spatial prediction of soil properties and classes using MLA's
This chapter reviews some common Machine learning algorithms (MLA's) that have
demonstrated potential for soil mapping projects i.e. for generating spatial
predictions [@brungard2015machine; @heung2016overview; @behrens2018multi].
In this tutorial we especially focus on using tree-based algorithms such as [random forest](https://en.wikipedia.org/wiki/Random_forest), [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) and [Cubist](https://cran.r-project.org/package=Cubist).
For a more in-depth overview of machine learning algorithms used in statistics
refer to the CRAN Task View on [Machine Learning & Statistical Learning](https://cran.r-project.org/web/views/MachineLearning.html).
As a gentle introduction to Machine and Statistical Learning we recommend:
* Irizarry, R.A., (2018) [**Introduction to Data Science: Data Analysis and Prediction Algorithms with R**](https://rafalab.github.io/dsbook/). HarvardX Data Science Series.
* Kuhn, M., Johnson, K. (2013) [**Applied Predictive Modeling**](http://appliedpredictivemodeling.com). Springer Science, ISBN: 9781461468493, 600 pages.
* Molnar, C. (2019) [**Interpretable Machine Learning: A Guide for Making Black Box Models Explainable**](https://christophm.github.io/interpretable-ml-book/), Leanpub, 251 pages.
Some other examples of how MLA's can be used to fit Pedo-Transfer-Functions can be found in section \@ref(mla-ptfs).
### Loading the packages and data
We use the following packages:
```{r}
library(plotKML)
library(sp)
library(randomForest)
library(nnet)
library(e1071)
library(GSIF)
library(plyr)
library(raster)
library(caret)
library(Cubist)
library(GSIF)
library(xgboost)
library(viridis)
```
```{r, include=FALSE}
h2o::h2o.no_progress()
```
Next, we load the ([Ebergotzen](http://plotkml.r-forge.r-project.org/eberg.html)) data set which consists of point data collected using a soil auger and a stack of rasters containing all covariates:
```{r}
library(plotKML)
data(eberg)
data(eberg_grid)
coordinates(eberg) <- ~X+Y
proj4string(eberg) <- CRS("+init=epsg:31467")
gridded(eberg_grid) <- ~x+y
proj4string(eberg_grid) <- CRS("+init=epsg:31467")
```
The covariates are then converted to principal components to reduce covariance and dimensionality:
```{r}
eberg_spc <- spc(eberg_grid, ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6)
eberg_grid@data <- cbind(eberg_grid@data, eberg_spc@predicted@data)
```
All further analysis is run using the so-called *regression matrix* (matrix produced using the overlay of points and grids), which contains values of the target variable and all covariates for all training points:
```{r}
ov <- over(eberg, eberg_grid)
m <- cbind(ov, eberg@data)
dim(m)
```
In this case the regression matrix consists of 3670 observations and has 44 columns.
### Spatial prediction of soil classes using MLA's
In the first example, we focus on mapping soil types using the auger point data. First, we need to filter out some classes that do not occur frequently enough to support statistical modelling. As a rule of thumb, a class to be modelled should have at least 5 observations:
```{r}
xg <- summary(m$TAXGRSC, maxsum=(1+length(levels(m$TAXGRSC))))
str(xg)
selg.levs <- attr(xg, "names")[xg > 5]
attr(xg, "names")[xg <= 5]
```
this shows that two classes probably have too few observations and should be excluded from further modeling:
```{r}
m$soiltype <- m$TAXGRSC
m$soiltype[which(!m$TAXGRSC %in% selg.levs)] <- NA
m$soiltype <- droplevels(m$soiltype)
str(summary(m$soiltype, maxsum=length(levels(m$soiltype))))
```
We can also remove all points that contain missing values for any combination of covariates and target variable:
```{r}
m <- m[complete.cases(m[,1:(ncol(eberg_grid)+2)]),]
m$soiltype <- as.factor(m$soiltype)
summary(m$soiltype)
```
We can now test fitting a MLA i.e. a random forest model using four covariate layers (parent material map, elevation, TWI and ASTER thermal band):
```{r}
## subset to speed-up:
s <- sample.int(nrow(m), 500)
TAXGRSC.rf <- randomForest(x=m[-s,paste0("PC",1:10)], y=m$soiltype[-s],
xtest=m[s,paste0("PC",1:10)], ytest=m$soiltype[s])
## accuracy:
TAXGRSC.rf$test$confusion[,"class.error"]
```
Note that, by specifying `xtest` and `ytest`, we run both model fitting and cross-validation with 500 excluded points. The results show relatively high prediction error of about 60% i.e. relative classification accuracy of about 40%.
We can also test some other MLA's that are suited for this data — `multinom` from the [nnet](https://cran.r-project.org/package=nnet) package, and `svm` (Support Vector Machine) from the [e1071](https://cran.r-project.org/package=e1071) package:
```{r}
TAXGRSC.rf <- randomForest(x=m[,paste0("PC",1:10)], y=m$soiltype)
fm <- as.formula(paste("soiltype~", paste(paste0("PC",1:10), collapse="+")))
TAXGRSC.mn <- nnet::multinom(fm, m)
TAXGRSC.svm <- e1071::svm(fm, m, probability=TRUE, cross=5)
TAXGRSC.svm$tot.accuracy
```
This produces about the same accuracy levels as for random forest. Because all three methods produce comparable accuracy, we can also merge predictions by calculating a simple average:
```{r}
probs1 <- predict(TAXGRSC.mn, eberg_grid@data, type="probs", na.action = na.pass)
probs2 <- predict(TAXGRSC.rf, eberg_grid@data, type="prob", na.action = na.pass)
probs3 <- attr(predict(TAXGRSC.svm, eberg_grid@data,
probability=TRUE, na.action = na.pass), "probabilities")
```
derive average prediction:
```{r}
leg <- levels(m$soiltype)
lt <- list(probs1[,leg], probs2[,leg], probs3[,leg])
probs <- Reduce("+", lt) / length(lt)
## copy and make new raster object:
eberg_soiltype <- eberg_grid
eberg_soiltype@data <- data.frame(probs)
```
Check that all predictions sum up to 100%:
```{r}
ch <- rowSums(eberg_soiltype@data)
summary(ch)
```
To plot the result we can use the raster package (Fig. \@ref(fig:plot-eberg-soiltype)):
```{r plot-eberg-soiltype, echo=FALSE, fig.width=9, fig.cap="Predicted soil types for the Ebergotzen case study."}
plot(raster::stack(eberg_soiltype), col=SAGA_pal[[10]], zlim=c(0,1))
```
By using the produced predictions we can further derive Confusion Index (to map thematic uncertainty) and see if some classes should be aggregated. We can also generate a factor-type map by selecting the most probable class for each pixel, by using e.g.:
```{r}
eberg_soiltype$cl <- as.factor(apply(eberg_soiltype@data,1,which.max))
levels(eberg_soiltype$cl) = attr(probs, "dimnames")[[2]][as.integer(levels(eberg_soiltype$cl))]
summary(eberg_soiltype$cl)
```
### Modelling numeric soil properties using h2o
Random forest is suited for both classification and regression problems (it is one of the most popular MLA's for soil mapping). Consequently, we can use it also for modelling numeric soil properties i.e. to fit models and generate predictions. However, because the randomForest package in R is not suited for large data sets, we can also use some parallelized version of random forest (or more scalable) i.e. the one implemented in the [h2o package](http://www.h2o.ai/) [@richter2015multi]. h2o is a Java-based implementation, therefore installing the package requires Java libraries (size of package is about 80MB so it might take some to download and install) and all computing is, in principle, run outside of R i.e. within the JVM (Java Virtual Machine).
In the following example we look at mapping sand content for the upper horizons. To initiate h2o we run:
```{r, message=FALSE}
library(h2o)
localH2O = h2o.init(startH2O=TRUE)
```
This shows that multiple cores will be used for computing (to control the number of cores you can use the `nthreads` argument). Next, we need to prepare the regression matrix and prediction locations using the `as.h2o` function so that they are visible to h2o:
```{r, message=FALSE}
eberg.hex <- as.h2o(m, destination_frame = "eberg.hex")
eberg.grid <- as.h2o(eberg_grid@data, destination_frame = "eberg.grid")
```
We can now fit a random forest model by using all the computing power available to us:
```{r}
RF.m <- h2o.randomForest(y = which(names(m)=="SNDMHT_A"),
x = which(names(m) %in% paste0("PC",1:10)),
training_frame = eberg.hex, ntree = 50)
RF.m
```
This shows that the model fitting R-square is about 50%. This is also indicated by the predicted vs observed plot:
```{r}
library(scales)
library(lattice)
SDN.pred <- as.data.frame(h2o.predict(RF.m, eberg.hex, na.action=na.pass))$predict
plt1 <- xyplot(m$SNDMHT_A ~ SDN.pred, asp=1,
par.settings=list(
plot.symbol = list(col=scales::alpha("black", 0.6),
fill=scales::alpha("red", 0.6), pch=21, cex=0.8)),
ylab="measured", xlab="predicted (machine learning)")
```
```{r obs-pred-snd, echo=FALSE, fig.cap="Measured vs predicted sand content based on the Random Forest model.", out.width="40%"}
knitr::include_graphics("figures/Measured_vs_predicted_SAND_plot.png")
```
To produce a map based on these predictions we use:
```{r}
eberg_grid$RFx <- as.data.frame(h2o.predict(RF.m, eberg.grid, na.action=na.pass))$predict
```
```{r map-snd, echo=FALSE, fig.width=8, out.width="80%", fig.cap="Predicted sand content based on random forest."}
eberg.pts = list("sp.points", eberg, pch = 21, cex = .7, col="black")
spplot(eberg_grid["RFx"], col.regions=rev(viridis(20)), sp.layout = list(eberg.pts))
```
h2o has another MLA of interest for soil mapping called *deep learning* (a feed-forward multilayer artificial neural network). Fitting the model is equivalent to using random forest:
```{r}
DL.m <- h2o.deeplearning(y = which(names(m)=="SNDMHT_A"),
x = which(names(m) %in% paste0("PC",1:10)),
training_frame = eberg.hex)
DL.m
```
Which delivers performance comparable to the random forest model. The output prediction map does show somewhat different patterns than the random forest predictions (compare Fig. \@ref(fig:map-snd) and Fig. \@ref(fig:map-snd-dl)).
```{r}
## predictions:
eberg_grid$DLx <- as.data.frame(h2o.predict(DL.m, eberg.grid, na.action=na.pass))$predict
```
```{r map-snd-dl, echo=FALSE, fig.width=8, out.width="80%", fig.cap="Predicted sand content based on deep learning."}
spplot(eberg_grid["DLx"], col.regions=rev(viridis(20)), sp.layout = list(eberg.pts))
```
Which of the two methods should we use? Since they both have comparable performance, the most logical option is to generate ensemble (merged) predictions i.e. to produce a map that shows patterns averaged between the two methods (note: many sophisticated MLA such as random forest, neural nets, SVM and similar will often produce comparable results i.e. they are often equally applicable and there is no clear *winner*). We can use weighted average i.e. R-square as a simple approach to produce merged predictions:
```{r}
rf.R2 <- RF.m@model$training_metrics@metrics$r2
dl.R2 <- DL.m@model$training_metrics@metrics$r2
eberg_grid$SNDMHT_A <- rowSums(cbind(eberg_grid$RFx*rf.R2,
eberg_grid$DLx*dl.R2), na.rm=TRUE)/(rf.R2+dl.R2)
```
```{r map-snd-ensemble, echo=FALSE, fig.width=8, out.width="80%", fig.cap="Predicted sand content based on ensemble predictions."}
spplot(eberg_grid["SNDMHT_A"], col.regions=rev(viridis(20)), sp.layout = list(eberg.pts))
```
Indeed, the output map now shows patterns of both methods and is more likely slightly more accurate than any of the individual MLA's [@krogh1996learning].
### Spatial prediction of 3D (numeric) variables {#prediction-3D}
In the final exercise, we look at another two ML-based packages that are also of interest for soil mapping projects — cubist [@kuhn2012cubist; @kuhn2013applied] and xgboost [@2016arXiv160302754C]. The object is now to fit models and predict continuous soil properties in 3D. To fine-tune some of the models we will also use the [caret](http://topepo.github.io/caret/) package, which is highly recommended for optimizing model fitting and cross-validation. Read more about how to derive soil organic carbon stock using 3D soil mapping in section \@ref(ocs-3d-approach).
We will use another soil mapping data set from Australia called [“Edgeroi”](http://gsif.r-forge.r-project.org/edgeroi.html), which is described in detail in @Malone2009Geoderma. We can load the profile data and covariates by using:
```{r}
data(edgeroi)
edgeroi.sp <- edgeroi$sites
coordinates(edgeroi.sp) <- ~ LONGDA94 + LATGDA94
proj4string(edgeroi.sp) <- CRS("+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs")
edgeroi.sp <- spTransform(edgeroi.sp, CRS("+init=epsg:28355"))
load("extdata/edgeroi.grids.rda")
gridded(edgeroi.grids) <- ~x+y
proj4string(edgeroi.grids) <- CRS("+init=epsg:28355")
```
Here we are interested in modelling soil organic carbon content in g/kg for different depths. We again start by producing the regression matrix:
```{r}
ov2 <- over(edgeroi.sp, edgeroi.grids)
ov2$SOURCEID <- edgeroi.sp$SOURCEID
str(ov2)
```
Because we will run 3D modelling, we also need to add depth of horizons. We use a small function to assign depth values as the center depth of each horizon (as shown in figure below). Because we know where the horizons start and stop, we can copy the values of target variables two times so that the model knows at which depths values of properties change.
```{r}
## Convert soil horizon data to x,y,d regression matrix for 3D modeling:
hor2xyd <- function(x, U="UHDICM", L="LHDICM", treshold.T=15){
x$DEPTH <- x[,U] + (x[,L] - x[,U])/2
x$THICK <- x[,L] - x[,U]
sel <- x$THICK < treshold.T
## begin and end of the horizon:
x1 <- x[!sel,]; x1$DEPTH = x1[,L]
x2 <- x[!sel,]; x2$DEPTH = x1[,U]
y <- do.call(rbind, list(x, x1, x2))
return(y)
}
```
```{r hor-3d-scheme, echo=FALSE, fig.cap="Training points assigned to a soil profile with 3 horizons. Using the function from above, we assign a total of 7 training points i.e. about 2 times more training points than there are horizons.", out.width="60%"}
knitr::include_graphics("figures/horizon_depths_for_3d_modeling_scheme.png")
```
```{r}
h2 <- hor2xyd(edgeroi$horizons)
## regression matrix:
m2 <- plyr::join_all(dfs = list(edgeroi$sites, h2, ov2))
## spatial prediction model:
formulaStringP2 <- ORCDRC ~ DEMSRT5+TWISRT5+PMTGEO5+
EV1MOD5+EV2MOD5+EV3MOD5+DEPTH
mP2 <- m2[complete.cases(m2[,all.vars(formulaStringP2)]),]
```
Note that `DEPTH` is used as a covariate, which makes this model 3D as one can predict anywhere in 3D space. To improve random forest modelling, we use the caret package that tries to identify also the optimal `mtry` parameter i.e. based on the cross-validation performance:
```{r}
library(caret)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=1)
sel <- sample.int(nrow(mP2), 500)
tr.ORCDRC.rf <- train(formulaStringP2, data = mP2[sel,],
method = "rf", trControl = ctrl, tuneLength = 3)
tr.ORCDRC.rf
```
In this case, `mtry = 12` seems to achieve the best performance. Note that we sub-set the initial matrix to speed up fine-tuning of the parameters (otherwise the computing time could easily become too great). Next, we can fit the final model by using all data (this time we also turn cross-validation off):
```{r}
ORCDRC.rf <- train(formulaStringP2, data=mP2,
method = "rf", tuneGrid=data.frame(mtry=7),
trControl=trainControl(method="none"))
w1 <- 100*max(tr.ORCDRC.rf$results$Rsquared)
```
The variable importance plot indicates that DEPTH is by far the most important predictor:
```{r varimp-plot-edgeroi, echo=FALSE, fig.cap="Variable importance plot for predicting soil organic carbon content (ORC) in 3D.", out.width="70%"}
varImpPlot(ORCDRC.rf$finalModel, cex.axis = .7, main = "")
```
We can also try fitting models using the xgboost package and the cubist packages:
```{r}
tr.ORCDRC.cb <- train(formulaStringP2, data=mP2[sel,],
method = "cubist", trControl = ctrl, tuneLength = 3)
ORCDRC.cb <- train(formulaStringP2, data=mP2,
method = "cubist",
tuneGrid=data.frame(committees = 1, neighbors = 0),
trControl=trainControl(method="none"))
w2 <- 100*max(tr.ORCDRC.cb$results$Rsquared)
## "XGBoost" package:
ORCDRC.gb <- train(formulaStringP2, data=mP2, method = "xgbTree", trControl=ctrl)
w3 <- 100*max(ORCDRC.gb$results$Rsquared)
c(w1, w2, w3)
```
At the end of the statistical modelling process, we can merge the predictions by using the CV R-square estimates:
```{r}
edgeroi.grids$DEPTH <- 2.5
edgeroi.grids$Random_forest <- predict(ORCDRC.rf, edgeroi.grids@data,
na.action = na.pass)
edgeroi.grids$Cubist <- predict(ORCDRC.cb, edgeroi.grids@data, na.action = na.pass)
edgeroi.grids$XGBoost <- predict(ORCDRC.gb, edgeroi.grids@data, na.action = na.pass)
edgeroi.grids$ORCDRC_5cm <- (edgeroi.grids$Random_forest*w1 +
edgeroi.grids$Cubist*w2 +
edgeroi.grids$XGBoost*w3)/(w1+w2+w3)
```
```{r maps-soc-edgeroi, echo=FALSE, fig.width=8, out.width="90%", fig.cap="Comparison of three MLA's and the final ensemble prediction (ORCDRC 5cm) of soil organic carbon content for 2.5 cm depth."}
edgeroi.pts = list("sp.points", edgeroi.sp, pch = 21, cex = .7, col="black")
spplot(edgeroi.grids[c("Random_forest","Cubist","XGBoost","ORCDRC_5cm")],
col.regions=rev(viridis(20)), sp.layout = list(edgeroi.pts))
```
The final plot shows that xgboost possibly over-predicts and that cubist possibly under-predicts values of `ORCDRC`, while random forest is somewhere in-between the two. Again, merged predictions are probably the safest option considering that all three MLA's have similar measures of performance.
We can quickly test the overall performance using a script on github prepared for testing performance of merged predictions:
```{r}
source_https <- function(url, ...) {
require(RCurl)
if(!file.exists(paste0("R/", basename(url)))){
cat(getURL(url, followlocation = TRUE,
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")),
file = paste0("R/", basename(url)))
}
source(paste0("R/", basename(url)))
}
wdir = "https://raw.githubusercontent.com/ISRICWorldSoil/SoilGrids250m/"
source_https(paste0(wdir, "master/grids/cv/cv_functions.R"))
```
We can hence run 5-fold cross validation:
```{r}
mP2$SOURCEID = paste(mP2$SOURCEID)
test.ORC <- cv_numeric(formulaStringP2, rmatrix=mP2,
nfold=5, idcol="SOURCEID", Log=TRUE)
str(test.ORC)
```
Which shows that the R-squared based on cross-validation is about 65% i.e. the average error of predicting soil organic carbon content using ensemble method is about $\pm 4$ g/kg. The final observed-vs-predict plot shows that the model is unbiased and that the predictions generally match cross-validation points:
```{r}
plt0 <- xyplot(test.ORC[[1]]$Observed ~ test.ORC[[1]]$Predicted, asp=1,
par.settings = list(plot.symbol = list(col=scales::alpha("black", 0.6), fill=scales::alpha("red", 0.6), pch=21, cex=0.6)),
scales = list(x=list(log=TRUE, equispaced.log=FALSE), y=list(log=TRUE, equispaced.log=FALSE)),
ylab="measured", xlab="predicted (machine learning)")
```
```{r plot-measured-predicted, echo=FALSE, fig.cap="Predicted vs observed plot for soil organic carbon ML-based model (Edgeroi data set).", out.width="40%"}
knitr::include_graphics("figures/Predicted_vs_observed_plot_for_SOC_edgeroi.png")
```
### Ensemble predictions using h2oEnsemble
Ensemble models often outperform single models. There is certainly opportunity for increasing mapping accuracy by combining the power of 3–4 MLA's. The h2o environment for ML offers automation of ensemble model fitting and predictions [@ledell2015scalable].
```{r, echo=FALSE}
## download from: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
library(h2o)
#devtools::install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")
library(h2oEnsemble)
```
we first specify all learners (MLA methods) of interest:
```{r, message=FALSE}
k.f = dismo::kfold(mP2, k=4)
summary(as.factor(k.f))
## split data into training and validation:
edgeroi_v.hex = as.h2o(mP2[k.f==1,], destination_frame = "eberg_v.hex")
edgeroi_t.hex = as.h2o(mP2[!k.f==1,], destination_frame = "eberg_t.hex")
learner <- c("h2o.randomForest.wrapper", "h2o.gbm.wrapper")
fit <- h2o.ensemble(x = which(names(m2) %in% all.vars(formulaStringP2)[-1]),
y = which(names(m2)=="ORCDRC"),
training_frame = edgeroi_t.hex, learner = learner,
cvControl = list(V = 5))
perf <- h2o.ensemble_performance(fit, newdata = edgeroi_v.hex)
perf
```
which shows that, in this specific case, the ensemble model is only slightly better than a single model. Note that we would need to repeat testing the ensemble modeling several times until we can be certain any actual actual gain in accuracy.
We can also test ensemble predictions using the cookfarm data set [@Gasch2015SPASTA]. This data set consists of 183 profiles, each consisting of multiple soil horizons (1050 in total). To create a regression matrix we use:
```{r}
data(cookfarm)
cookfarm.hor <- cookfarm$profiles
str(cookfarm.hor)
cookfarm.hor$depth <- cookfarm.hor$UHDICM +
(cookfarm.hor$LHDICM - cookfarm.hor$UHDICM)/2
sel.id <- !duplicated(cookfarm.hor$SOURCEID)
cookfarm.xy <- cookfarm.hor[sel.id,c("SOURCEID","Easting","Northing")]
str(cookfarm.xy)
coordinates(cookfarm.xy) <- ~ Easting + Northing
grid10m <- cookfarm$grids
coordinates(grid10m) <- ~ x + y
gridded(grid10m) = TRUE
ov.cf <- over(cookfarm.xy, grid10m)
rm.cookfarm <- plyr::join(cookfarm.hor, cbind(cookfarm.xy@data, ov.cf))
```
Here, we are interested in predicting soil pH in 3D, hence we will use a model of form:
```{r}
fm.PHI <- PHIHOX~DEM+TWI+NDRE.M+Cook_fall_ECa+Cook_spr_ECa+depth
rc <- complete.cases(rm.cookfarm[,all.vars(fm.PHI)])
mP3 <- rm.cookfarm[rc,all.vars(fm.PHI)]
str(mP3)
```
We can again test fitting an ensemble model using two MLA's:
```{r, message=FALSE}
k.f3 <- dismo::kfold(mP3, k=4)
## split data into training and validation:
cookfarm_v.hex <- as.h2o(mP3[k.f3==1,], destination_frame = "cookfarm_v.hex")
cookfarm_t.hex <- as.h2o(mP3[!k.f3==1,], destination_frame = "cookfarm_t.hex")
learner3 = c("h2o.glm.wrapper", "h2o.randomForest.wrapper",
"h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
fit3 <- h2o.ensemble(x = which(names(mP3) %in% all.vars(fm.PHI)[-1]),
y = which(names(mP3)=="PHIHOX"),
training_frame = cookfarm_t.hex, learner = learner3,
cvControl = list(V = 5))
perf3 <- h2o.ensemble_performance(fit3, newdata = cookfarm_v.hex)
perf3
```
In this case Ensemble performance (MSE) seems to be *as bad* as the single best spatial predictor (random forest in this case). This illustrates that ensemble predictions are sometimes not beneficial.
```{r, message=FALSE}
h2o.shutdown()
```
### Ensemble predictions using SuperLearner package
Another interesting package to generate ensemble predictions of soil properties and classes is the SuperLearner package [@polley2010super]. This package has many more options than `h2o.ensemble` considering the number of methods available for consideration:
```{r}
library(SuperLearner)
# List available models:
listWrappers()
```
where `SL.` refers to an imported method from a package e.g. `"SL.ranger"` is the SuperLearner method from the package ranger.
A useful functionality of the SuperLearner package is that it displays how model average weights are estimated and which methods can safely be excluded from predictions. When using SuperLearner, however, it is highly recommended to use the parallelized / multicore version, otherwise the computing time might be quite excessive. For example, to prepare ensemble predictions using the five standard prediction techniques used in this tutorial we would run:
```{r}
## detach snowfall package otherwise possible conflicts
#detach("package:snowfall", unload=TRUE)
library(parallel)
sl.l = c("SL.mean", "SL.xgboost", "SL.ksvm", "SL.glmnet", "SL.ranger")
cl <- parallel::makeCluster(detectCores())
x <- parallel::clusterEvalQ(cl, library(SuperLearner))
sl <- snowSuperLearner(Y = mP3$PHIHOX,
X = mP3[,all.vars(fm.PHI)[-1]],
cluster = cl,
SL.library = sl.l)
sl
```
This shows that `SL.xgboost_All` outperforms the competition by a large margin. Since this is a relatively small data set, RMSE produced by `SL.xgboost_All` is probably unrealistically small. If we only use the top three models (XGboost, ranger and ksvm) in comparison we get:
```{r}
sl.l2 = c("SL.xgboost", "SL.ranger", "SL.ksvm")
sl2 <- snowSuperLearner(Y = mP3$PHIHOX,
X = mP3[,all.vars(fm.PHI)[-1]],
cluster = cl,
SL.library = sl.l2)
sl2
```
again `SL.xgboost` dominates the ensemble model, which is most likely unrealistic because most of the training data is spatially clustered and hence XGboost is probably over-fitting. To estimate actual accuracy of predicting soil pH using these two techniques we can run cross-validation where entire profiles are taken out of the training dataset:
```{r}
str(rm.cookfarm$SOURCEID)
cv_sl <- CV.SuperLearner(Y = mP3$PHIHOX,
X = mP3[,all.vars(fm.PHI)[-1]],
parallel = cl,
SL.library = sl.l2,
V=5, id=rm.cookfarm$SOURCEID[rc],
verbose=TRUE)
summary(cv_sl)
```
where `V=5` specifies number of folds, and `id=rm.cookfarm$SOURCEID` enforces that entire profiles are removed from training and cross-validation. This gives a more realistic RMSE of about ±0.35. Note that this time `SL.xgboost_All` is even somewhat worse than the random forest model, and the ensemble model (`Super Learner`) is slightly better than each individual model. This matches our previous results with `h20.ensemble`.
To produce predictions of soil pH at 10 cm depth we can finally use:
```{r}
sl2 <- snowSuperLearner(Y = mP3$PHIHOX,
X = mP3[,all.vars(fm.PHI)[-1]],
cluster = cl,
SL.library = sl.l2,
id=rm.cookfarm$SOURCEID[rc],
cvControl=list(V=5))
sl2
new.data <- grid10m@data
pred.PHI <- list(NULL)
depths = c(10,30,50,70,90)
for(j in 1:length(depths)){
new.data$depth = depths[j]
pred.PHI[[j]] <- predict(sl2, new.data[,sl2$varNames])
}
str(pred.PHI[[1]])
```
this yields two outputs:
* ensemble prediction in the `pred` matrix,
* list of individual predictions in the `library.predict` matrix,
To visualize the predictions (at six depths) we can run:
```{r ph-cookfarm, echo=TRUE, fig.width=8, fig.cap="Predicted soil pH using 3D ensemble model."}
for(j in 1:length(depths)){
grid10m@data[,paste0("PHI.", depths[j],"cm")] <- pred.PHI[[j]]$pred[,1]
}
spplot(grid10m, paste0("PHI.", depths,"cm"),
col.regions=R_pal[["pH_pal"]], as.table=TRUE)
```
The second prediction matrix can be used to determine *model uncertainty*:
```{r ph-cookfarm-var, echo=TRUE, fig.width=7, out.width="75%", fig.cap="Example of variance of prediction models for soil pH."}
library(matrixStats)
grid10m$PHI.10cm.sd <- rowSds(pred.PHI[[1]]$library.predict, na.rm=TRUE)
pts = list("sp.points", cookfarm.xy, pch="+", col="black", cex=1.4)
spplot(grid10m, "PHI.10cm.sd", sp.layout = list(pts), col.regions=rev(bpy.colors()))
```
which highlights the especially problematic areas, in this case most likely correlated with extrapolation in feature space. Before we stop computing, we need to close the cluster session by using:
```{r}
stopCluster(cl)
```
## A generic framework for spatial prediction using Random Forest
We have seen, in the above examples, that MLA's can be used efficiently to
map soil properties and classes. Most currently used MLA's, however, ignore the spatial
locations of the observations and hence overlook any spatial autocorrelation in
the data not accounted for by the covariates. Spatial auto-correlation,
especially if it remains visible in the cross-validation residuals, indicates
that the predictions are perhaps biased, and this is sub-optimal.
To account for this, @Hengl2018RFsp describe a framework for using Random Forest
(as implemented in the ranger package) in combination with geographical
distances to sampling locations (which provide measures of relative spatial location)
to fit models and predict values (RFsp).
### General principle of RFsp
RF is, in essence, a non-spatial approach to spatial prediction, as
the sampling locations and general sampling pattern are both ignored during
the estimation of MLA model parameters. This can potentially lead to
sub-optimal predictions and possibly systematic over- or
under-prediction, especially where the spatial autocorrelation in the
target variable is high and where point patterns show clear sampling
bias. To overcome this problem @Hengl2018RFsp propose the following generic *“RFsp”*
system:
\begin{equation}
Y({{\bf s}}) = f \left( {{\bf X}_G}, {{\bf X}_R}, {{\bf X}_P} \right)
(\#eq:rf-BUGP)
\end{equation}
where ${{\bf X}_G}$ are covariates accounting for geographical proximity
and spatial relations between observations (to mimic spatial correlation
used in kriging):
\begin{equation}
{{\bf X}_G} = \left( d_{p1}, d_{p2}, \ldots , d_{pN} \right)
\end{equation}
where $d_{pi}$ is the buffer distance (or any other complex proximity
upslope/downslope distance, as explained in the next section) to the
observed location $pi$ from ${\bf s}$ and $N$ is the total number of
training points. ${{\bf X}_R}$ are surface reflectance covariates, i.e.
usually spectral bands of remote sensing images, and ${{\bf X}_P}$ are
process-based covariates. For example, the Landsat infrared band is a
surface reflectance covariate, while the topographic wetness index and
soil weathering index are process-based covariates. Geographic
covariates are often smooth and reflect geometric composition of points,
reflectance-based covariates can exhibit a significant amount of noise and
usually provide information only about the surface of objects. Process-based
covariates require specialized knowledge and rethinking of how to
best represent processes. Assuming that the RFsp is fitted only using the
${\bf {X}_G}$, the predictions would resemble ordinary kriging (OK). If All covariates are
used Eq. \@ref(eq:rf-BUGP), RFsp would resemble regression-kriging (RK).
Similar framework where distances to the center and edges of the study area
and similar are used for prediction has been also proposed by @Behrens2018EJSS.
### Geographical covariates {#geographical-covariates}
One of the key principles of geography is that *“everything is related
to everything else, but near things are more related than distant
things”* [@miller2004tobler]. This principle forms the basis of
geostatistics, which converts this rule into a mathematical model, i.e.,
through spatial autocorrelation functions or variograms. The key to
making RF applicable to spatial statistics problems, therefore, lies also in
preparing geographical (spatial) measures of proximity and connectivity between
observations, so that spatial autocorrelation can be accounted for. There
are multiple options for variables that quantify proximity and geographical
connection (Fig. \@ref(fig:distances-examples)):
1. Geographical coordinates $s_1$ and $s_2$, i.e., easting
and northing.
2. Euclidean distances to reference points in the study area. For
example, distance to the center and edges of the study area,
etc [@Behrens2018EJSS].
3. Euclidean distances to sampling locations, i.e., distances from
observation locations. Here one buffer distance map can be generated
per observation point or group of points. These are essentially the same distance
measures as used in geostatistics.
4. Downslope distances, i.e., distances within a watershed: for each
sampling point one can derive upslope/downslope distances to the
ridges and hydrological network and/or downslope or upslope areas
[@GRUBER2009171]. This requires, in addition to using a Digital Elevation
Model, implementing a hydrological analysis of the terrain.
5. Resistance distances or weighted buffer distances, i.e., distances
of the cumulative effort derived using terrain ruggedness and/or
natural obstacles.
The [gdistance](https://cran.r-project.org/package=gdistance) package, for example, provides a framework to derive complex
distances based on terrain complexity [@vanEtten2017r]. Here additional
inputs required to compute complex distances are the Digital Elevation Model (DEM)
and DEM-derivatives, such as slope (Fig. \@ref(fig:distances-examples)b).
SAGA GIS [@gmd-8-1991-2015] offers a wide variety of DEM derivatives
that can be derived per location of interest.
```{r distances-examples, echo=FALSE, fig.cap="Examples of distance maps to some location in space (yellow dot) based on different derivation algorithms: (a) simple Euclidean distances, (b) complex speed-based distances based on the gdistance package and Digital Elevation Model (DEM), and (c) upslope area derived based on the DEM in SAGA GIS. Image source: Hengl et al. (2018) doi: 10.7717/peerj.5518.", out.width="100%"}
knitr::include_graphics("figures/Fig_distances_examples.png")
```
Here, we only illustrate predictive performance using Euclidean buffer distances
(to all sampling points), but the code could be adapted to
include other families of geographical covariates (as shown in
Fig. \@ref(fig:distances-examples)). Note also that RF tolerates a high
number of covariates and multicolinearity [@Biau2016], hence multiple
types of geographical covariates (Euclidean buffer distances, upslope
and downslope areas) could be considered concurrently.
### Spatial prediction 2D continuous variable using RFsp
To run these examples, it is recommended to install [ranger](https://github.com/imbs-hl/ranger) [@wright2017ranger] directly from github:
```{r, eval=FALSE, echo=TRUE}
if(!require(ranger)){ devtools::install_github("imbs-hl/ranger") }
```
Quantile regression random forest and derivation of standard errors using Jackknifing is available from ranger version >0.9.4. Other packages that we use here include:
```{r, echo=TRUE}
library(GSIF)
library(rgdal)
library(raster)
library(geoR)
library(ranger)
```
```{r, echo=FALSE, warning=FALSE}
library(gstat)
library(plyr)
library(plotKML)
library(scales)
library(parallel)
library(lattice)
library(gridExtra)
```
If no other information is available, we can use buffer distances to all points as covariates to predict values of some continuous or categorical variable in the RFsp framework. These can be derived with the help of the [raster](https://cran.r-project.org/package=raster) package [@raster]. Consider for example the meuse data set from the sp package:
```{r meuse}
demo(meuse, echo=FALSE)
```
We can derive buffer distance by using:
```{r bufferdist}
grid.dist0 <- GSIF::buffer.dist(meuse["zinc"], meuse.grid[1], as.factor(1:nrow(meuse)))
```
which requires a few seconds, as it generates 155 individual gridded maps. The value of the target variable `zinc` can be now modeled as a function of these computed buffer distances:
```{r}
dn0 <- paste(names(grid.dist0), collapse="+")
fm0 <- as.formula(paste("zinc ~ ", dn0))
fm0
```
Subsequent analysis is similar to any regression analysis using the [ranger package](https://github.com/imbs-hl/ranger). First we overlay points and grids to create a regression matrix:
```{r}
ov.zinc <- over(meuse["zinc"], grid.dist0)
rm.zinc <- cbind(meuse@data["zinc"], ov.zinc)
```
to estimate also the prediction error variance i.e. prediction intervals we set `quantreg=TRUE` which initiates the Quantile Regression RF approach [@meinshausen2006quantile]:
```{r}
m.zinc <- ranger(fm0, rm.zinc, quantreg=TRUE, num.trees=150, seed=1)
m.zinc
```
This shows that, using only buffer distance explains almost 50% of the variation in the target variable. To generate predictions for the `zinc` variable and using the RFsp model, we use:
```{r}
q <- c((1-.682)/2, 0.5, 1-(1-.682)/2)
zinc.rfd <- predict(m.zinc, grid.dist0@data,
type="quantiles", quantiles=q)$predictions
str(zinc.rfd)
```
this will estimate 67% probability lower and upper limits and median value. Note that “median” can often be different from the “mean”, so, if you prefer to derive mean, then the `quantreg=FALSE` needs to be used as the Quantile Regression Forests approach can only derive median.
To be able to plot or export the predicted values as maps, we add them to the spatial pixels object:
```{r}
meuse.grid$zinc_rfd = zinc.rfd[,2]
meuse.grid$zinc_rfd_range = (zinc.rfd[,3]-zinc.rfd[,1])/2
```
We can compare the RFsp approach with the model-based geostatistics approach (see e.g. [geoR package](http://leg.ufpr.br/geoR/geoRdoc/geoRintro.html)), where we first decide about the transformation, then fit the variogram of the target variable [@Diggle2007Springer; @Brown2014JSS]:
```{r}
zinc.geo <- as.geodata(meuse["zinc"])
ini.v <- c(var(log1p(zinc.geo$data)),500)
zinc.vgm <- likfit(zinc.geo, lambda=0, ini=ini.v, cov.model="exponential")
zinc.vgm
```
where `likfit` function fits a log-likelihood based variogram. Note that here we need to manually specify log-transformation via the `lambda` parameter. To generate predictions and kriging variance using geoR we run:
```{r}
locs <- meuse.grid@coords
zinc.ok <- krige.conv(zinc.geo, locations=locs, krige=krige.control(obj.model=zinc.vgm))
meuse.grid$zinc_ok <- zinc.ok$predict
meuse.grid$zinc_ok_range <- sqrt(zinc.ok$krige.var)
```
in this case geoR automatically back-transforms values to the original scale, which is a recommended feature. Comparison of predictions and prediction error maps produced using geoR (ordinary kriging) and RFsp (with buffer distances and using just coordinates) is given in Fig. \@ref(fig:comparison-OK-RF-zinc-meuse).
```{r comparison-OK-RF-zinc-meuse, echo=FALSE, dpi = 300, fig.cap="Comparison of predictions based on ordinary kriging as implemented in the geoR package (left) and random forest (right) for Zinc concentrations, Meuse data set: (first row) predicted concentrations in log-scale and (second row) standard deviation of the prediction errors for OK and RF methods. Image source: Hengl et al. (2018) doi: 10.7717/peerj.5518.", out.width="100%"}
knitr::include_graphics("figures/Fig_comparison_OK_RF_zinc_meuse.png")
```
From the plot above, it can be concluded that RFsp yields very similar results to those produced using ordinary kriging via geoR. There are differences between geoR and RFsp, however. These are:
- RF requires no transformation i.e. works equally well with skewed and normally distributed variables; in general RF, requires fewer statistical assumptions than model-based geostatistics,
- RF prediction error variance on average shows somewhat stronger contrast than OK variance map i.e. it emphasizes isolated, less probable, local points much more than geoR,
- RFsp is significantly more computationally demanding as distances need to be derived from each sampling point to all new prediction locations,
- geoR uses global model parameters and, as such, prediction patterns are also relatively uniform, RFsp on the other hand (being tree-based) will produce patterns that match the data as much as possible.
### Spatial prediction 2D variable with covariates using RFsp
Next, we can also consider adding additional covariates that describe soil forming processes or characteristics of the land to the list of buffer distances. For example, we can add covariates for surface water occurrence [@pekel2016high] and elevation ([AHN](http://ahn.nl)):
```{r}
f1 = "extdata/Meuse_GlobalSurfaceWater_occurrence.tif"
f2 = "extdata/ahn.asc"
meuse.grid$SW_occurrence <- readGDAL(f1)$band1[[email protected]]
meuse.grid$AHN = readGDAL(f2)$band1[[email protected]]
```
to convert all covariates to numeric values and fill in all missing pixels we use Principal Component transformation:
```{r}
grids.spc <- GSIF::spc(meuse.grid, as.formula("~ SW_occurrence + AHN + ffreq + dist"))
```
so that we can fit a ranger model using both geographical covariates (buffer distances) and environmental covariates imported previously:
```{r}
nms <- paste(names(grids.spc@predicted), collapse = "+")
fm1 <- as.formula(paste("zinc ~ ", dn0, " + ", nms))
fm1
ov.zinc1 <- over(meuse["zinc"], grids.spc@predicted)
rm.zinc1 <- do.call(cbind, list(meuse@data["zinc"], ov.zinc, ov.zinc1))
```
this finally gives:
```{r}
m1.zinc <- ranger(fm1, rm.zinc1, importance="impurity",
quantreg=TRUE, num.trees=150, seed=1)
m1.zinc
```
which demonstrates that there is a slight improvement relative to using only buffer distances as covariates.
We can further evaluate this model to see which specific points and covariates are
most important for spatial predictions:
```{r rf-variableImportance, fig.width=5, out.width="65%", fig.cap="Variable importance plot for mapping zinc content based on the Meuse data set."}
xl <- as.list(ranger::importance(m1.zinc))
par(mfrow=c(1,1),oma=c(0.7,2,0,1), mar=c(4,3.5,1,0))
plot(vv <- t(data.frame(xl[order(unlist(xl), decreasing=TRUE)[10:1]])), 1:10,
type = "n", ylab = "", yaxt = "n", xlab = "Variable Importance (Node Impurity)",
cex.axis = .7, cex.lab = .7)
abline(h = 1:10, lty = "dotted", col = "grey60")
points(vv, 1:10)
axis(2, 1:10, labels = dimnames(vv)[[1]], las = 2, cex.axis = .7)
```
which shows, for example, that locations 54, 59 and 53 are the most influential points,
and these are almost equally as important as the environmental covariates (PC2–PC4).
This type of modeling can be best compared to using Universal Kriging or Regression-Kriging in the geoR package:
```{r}
zinc.geo$covariate = ov.zinc1
sic.t = ~ PC1 + PC2 + PC3 + PC4 + PC5
zinc1.vgm <- likfit(zinc.geo, trend = sic.t, lambda=0,
ini=ini.v, cov.model="exponential")
zinc1.vgm
```
this time geostatistical modeling produces an estimate of beta (regression coefficients) and variogram parameters (all estimated at once). Predictions using this Universal Kriging model can be generated by:
```{r}
KC = krige.control(trend.d = sic.t,
trend.l = ~ grids.spc@predicted$PC1 +
grids.spc@predicted$PC2 + grids.spc@predicted$PC3 +
grids.spc@predicted$PC4 + grids.spc@predicted$PC5,
obj.model = zinc1.vgm)
zinc.uk <- krige.conv(zinc.geo, locations=locs, krige=KC)
meuse.grid$zinc_UK = zinc.uk$predict
```
```{r RF-covs-bufferdist-zinc-meuse, echo=FALSE, dpi = 300, fig.cap="Comparison of predictions (median values) produced using random forest and covariates only (left), and random forest with combined covariates and buffer distances (right).", out.width="80%"}
knitr::include_graphics("figures/Fig_RF_covs_bufferdist_zinc_meuse.png")
```
again, overall predictions (the spatial patterns) look fairly similar (Fig. \@ref(fig:RF-covs-bufferdist-zinc-meuse)).
The difference between using geoR and RFsp is that, in the case of RFsp, there are fewer choices
and fewer assumptions required. Also, RFsp permits the relationship between covariates
and geographical distances to be fitted concurrently. This makes RFsp, in general, less
cumbersome than model-based geostatistics, but then also more of a “black-box” system
to a geostatistician.
### Spatial prediction of binomial variables
RFsp can also be used to predict (map the distribution of) binomial variables i.e. variables having only two states (TRUE or FALSE). In the model-based geostatistics equivalent methods are indicator kriging and similar. Consider for example soil type 1 from the meuse data set:
```{r}
meuse@data = cbind(meuse@data, data.frame(model.matrix(~soil-1, meuse@data)))
summary(as.factor(meuse$soil1))
```
in this case class `soil1` is the dominant soil type in the area. To produce a map of `soil1` using RFsp we have now two options:
- _Option 1_: treat the binomial variable as numeric variable with 0 / 1 values (thus a regression problem),
- _Option 2_: treat the binomial variable as a factor variable with a single class (thus a classification problem),
In the case of Option 1, we model `soil1` as:
```{r}
fm.s1 <- as.formula(paste("soil1 ~ ", paste(names(grid.dist0), collapse="+"),
" + SW_occurrence + dist"))
rm.s1 <- do.call(cbind, list(meuse@data["soil1"],
over(meuse["soil1"], meuse.grid),
over(meuse["soil1"], grid.dist0)))
m1.s1 <- ranger(fm.s1, rm.s1, mtry=22, num.trees=150, seed=1, quantreg=TRUE)
m1.s1
```
which results in a model that explains about 75% of variability in the `soil1` values.
We set `quantreg=TRUE` so that we can also derive lower and upper prediction
intervals following the quantile regression random forest [@meinshausen2006quantile].
In the case of Option 2, we treat the binomial variable as a factor variable:
```{r}
fm.s1c <- as.formula(paste("soil1c ~ ",
paste(names(grid.dist0), collapse="+"),
" + SW_occurrence + dist"))
rm.s1$soil1c = as.factor(rm.s1$soil1)
m2.s1 <- ranger(fm.s1c, rm.s1, mtry=22, num.trees=150, seed=1,
probability=TRUE, keep.inbag=TRUE)
m2.s1
```
which shows that the Out of Bag prediction error (classification error) is (only)
0.06 (in the probability scale). Note that, it is not easy to compare the results
of the regression and classification OOB errors as these are conceptually different.
Also note that we turn on `keep.inbag = TRUE` so that ranger can estimate the
classification errors using the Jackknife-after-Bootstrap method [@wager2014confidence].
`quantreg=TRUE` obviously would not work here since it is a classification and not a regression problem.
To produce predictions using the two options we use:
```{r}
pred.regr <- predict(m1.s1, cbind(meuse.grid@data, grid.dist0@data), type="response")
pred.clas <- predict(m2.s1, cbind(meuse.grid@data, grid.dist0@data), type="se")
```
in principle, the two options to predicting the distribution of the binomial variable are mathematically equivalent and should lead to the same predictions (also shown in the map below). In practice, there can be some small differences in numbers, due to rounding effect or random start effects.
```{r comparison-uncertainty-Binomial, echo=FALSE, dpi=300, fig.cap="Comparison of predictions for soil class “1” produced using (left) regression and prediction of the median value, (middle) regression and prediction of response value, and (right) classification with probabilities.", out.width="90%"}
knitr::include_graphics("figures/Fig_comparison_uncertainty_Binomial_variables_meuse.png")
```
This shows that predicting binomial variables using RFsp can be implemented both as a classification and a regression problem and both are possible to implement using the ranger package and both should lead to relatively the same results.
### Spatial prediction of soil types
Spatial prediction of a categorical variable using ranger is a form of classification problem. The target variable contains multiple states (3 in this case), but the model still follows the same formulation:
```{r}
fm.s = as.formula(paste("soil ~ ", paste(names(grid.dist0), collapse="+"),
" + SW_occurrence + dist"))
fm.s
```
to produce probability maps per soil class, we need to turn on the `probability=TRUE` option:
```{r}
rm.s <- do.call(cbind, list(meuse@data["soil"],
over(meuse["soil"], meuse.grid),
over(meuse["soil"], grid.dist0)))
m.s <- ranger(fm.s, rm.s, mtry=22, num.trees=150, seed=1,
probability=TRUE, keep.inbag=TRUE)
m.s
```
this shows that the model is successful with an OOB prediction error of about 0.09. This number is rather abstract so we can also check the actual classification accuracy using hard classes:
```{r}
m.s0 <- ranger(fm.s, rm.s, mtry=22, num.trees=150, seed=1)
m.s0
```
which shows that the classification or mapping accuracy for hard classes is about 90%. We can produce predictions of probabilities per class by running:
```{r}
pred.soil_rfc = predict(m.s, cbind(meuse.grid@data, grid.dist0@data), type="se")