The goal of this repo is to demonstrate spatio-temporal prediction models to estimate levels of air pollution.
The input dataset is an Excel file provided as part of the OpenGeoHub Summer School 2019.
We’ll use these packages
suppressPackageStartupMessages({
library(dplyr)
library(sf)
})
And read-in the input data as follows
train = readxl::read_excel("SpatialPrediction.xlsx", sheet = 1)
covar = readxl::read_excel("SpatialPrediction.xlsx", sheet = 2)
locat = readxl::read_excel("SpatialPrediction.xlsx", sheet = 3)
# times = readxl::read_excel("SpatialPrediction.xlsx", sheet = 4) # what is this?
targt = readxl::read_excel("SpatialPrediction.xlsx", sheet = 5)
The objective is to fill the NA values in the targt
data:
targt[1:3]
#> # A tibble: 5,004 x 3
#> id time PM10
#> <chr> <dttm> <lgl>
#> 1 5a5da3c80aa2a900127f895a 2019-04-06 18:00:00 NA
#> 2 590752d15ba9e500112b21db 2019-04-09 06:00:00 NA
#> 3 5a58cb80999d43001b7c4ecb 2019-04-03 22:00:00 NA
#> 4 5a5da3c80aa2a900127f895a 2019-04-03 00:00:00 NA
#> 5 5a636a22411a790019bdcafd 2019-04-07 10:00:00 NA
#> 6 5c49b10c35acab0019e6ce19 2019-04-03 16:00:00 NA
#> 7 5a1b3c7d19991f0011b83054 2019-04-14 04:00:00 NA
#> 8 5c57147435809500190ef1fd 2019-04-06 12:00:00 NA
#> 9 5978e8fbfe1c74001199fa2a 2019-04-06 07:00:00 NA
#> 10 5909d039dd09cc001199a6bf 2019-04-09 15:00:00 NA
#> # … with 4,994 more rows
Let’s do some data cleaning and plot the data:
d = inner_join(train, covar)
#> Joining, by = c("id", "time")
d = inner_join(d, locat)
#> Joining, by = "id"
dsf = sf::st_as_sf(d, coords = c("X", "Y"), crs = 4326)
summary(dsf)
#> id time PM10
#> Length:23719 Min. :2019-04-01 00:00:00 Min. : 0.00
#> Class :character 1st Qu.:2019-04-03 21:00:00 1st Qu.: 8.75
#> Mode :character Median :2019-04-06 19:00:00 Median :14.97
#> Mean :2019-04-07 12:57:52 Mean :19.78
#> 3rd Qu.:2019-04-11 07:00:00 3rd Qu.:25.25
#> Max. :2019-04-14 23:00:00 Max. :99.87
#> humidity temperature geometry
#> Min. : 0.00 Min. :-140.760 POINT :23719
#> 1st Qu.: 60.70 1st Qu.: 6.480 epsg:4326 : 0
#> Median : 87.65 Median : 9.100 +proj=long...: 0
#> Mean : 77.98 Mean : 8.051
#> 3rd Qu.: 99.90 3rd Qu.: 12.688
#> Max. :100.00 Max. : 50.000
mapview::mapview(dsf %>% sample_n(1000))
A simple model:
m = lm(PM10 ~ humidity + temperature, data = d)
p = predict(object = m, newdata = d)
plot(d$PM10, p)
cor(d$PM10, p)^2
#> [1] 0.02936257
A simple linear model can explain ~3% of the variability in PM10 levels, not great!