-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
292 lines (201 loc) · 8.87 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
output: rmarkdown::github_document
---
# Introduction
The RIVM datacube is a data repository aimed at spatial grid based data,
although it can manage any kind of data. It is used in our data-science
projects involving spatial analyses, modeling and prediction. It is
suitable for combining field measurements, e.g. from monitoring networks,
with other spatial data (e.g. soil types, groundwater levels, land use,
altitude, crops, emission data etc.). All the spatial data in the
datacube is georeferenced to the same extent, so maps can be stacked
easily. These stacks are the actual datacubes we use in our machine
learning models.
This package contains functions to work with GIS data and PostGIS, it
manages data by storing data to the repository, generate
meta-data, creates audit trails or data lineage paths, and stores
versioning info. It is aimed at small teams working together with the
same data. The datacube package makes it possible to work on projects
which are reusable, reproducible and auditable.
# Important
*Please note:* this is work in progress. This package needs git and
PostgreSQL/PostGIS for proper working. It assumes a Linux OS (it might
work under Windows but we never tried).
Installation in R , using the devtools package:
`devtools::install_github("jspijker/datacube")`
Please make sure you already have the fasterize and the here package
installed.
## Future development
I presented this package at the useR!2019 conference. During the
discussions I noticed that people really like the idea of a data
repository and to have the possibility to create an audit trail for
their data. However, the name of this package, datacube, is confusing.
We use this for datacubes but it is not limited to datacubes, it can
be used for any data or data-workflow.
To make this package more useful for a general R audience, we will split
this package into two parts. We keep all the stuff about datacubes and
spatial rasters in our datacube package. It's the stuff we love and use,
and it 'works for us'. The stuff about the data repository, audit trails
etc. will go into a separate package, so it can be of use for others.
# Dependencies
The datacube package uses 3 other related packages:
1. pgobjects is a package to store R objects, either variables, functions, or complete environments, into a PostgreSQL database.
2. The pgblobs package is an extension of pgobjects. If objects are to big to store in the database, like raster grids or spatial data, only the meta data is stored in the database and the file, or blob, is stored on a shared disk location.
3. The localoptions package is used to read an options file with the datacube configuration:
These packages can be found on github:
* [gobjects](https://github.com/jspijker/pgobjects)
* [pgblobs](https://github.com/jspijker/pgblobs)
* [localoptions](https://github.com/jspijker/localoptions)
# Configuration
For the configuration of the system the localoptions package is used.
With localoptions an option file is read with the database
configuration and file locations. The default location of this option
file is ~/.R.options and looks like this:
```
# database host, database name, user, and password
datacube.host localhost
datacube.dbname datacube
datacube.user username
datacube.password verysecret
# database schema for pgobjects tables (default is public)
datacube.schema datacube
# location to store file blobs (shared network location)
datacube.blobs /datacube/blobs
```
After a correct configuration, you can initialize database and create
the necessary tables:
```{r,eval=FALSE}
library(pgobjects)
createPgobjTables()
```
# initialization
After the database is setup, one can load the necessary packages.
We prefer to use pacman for that.
```{r}
# load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(parallel,raster,ggplot2,sp,maptools,RCurl,
RPostgreSQL,rgdal,gdalUtils,sf,fasterize,foreign,tidyverse,here)
# packages on github to use
pacman::p_load_gh("jspijker/localoptions","jspijker/pgobjects",
"jspijker/pgblobs")
# load datacube
devtools::load_all()
```
After setting up the configuration, you can create your first project.
Projects using the datacube package are organised withing git
repositories. Each repository can contain multiple projects. For each
project the user has to create a separate directory at the root of the
repository.
Then the datacube is initialized, using the name of project (workdir)
and the name of the script.
Part of the initialization is the setup of the database connection.
Also a data directory is created and meta data about the git
repository is collected. The working directory is changed to the
project directory. Don't you dare to use `setwd()` in your scripts.
```{r}
# initialize datacube
datacubeInit(script="README.Rmd",workdir=".")
```
# Import, tidy and transform data
To demonstrate the datacube we import data from a source location, and
then tidy and transform it. For this demonstration we use the 'groenbeleving' indicator from
the Dutch Health Atlas. This indicator is about the percentage of people
within a municipality who are satisfied about the amount of green area
in their living environment. The data is published as WFS service in
the RIVM geoservice.
We'll import the spatial vector data and then transform it to a
georeferenced 25x25m raster. The used georeference is a standard
georeference used for all the raster layers, so all the layers can be
stacked into a multidimensional raster stack. This stack is
subsequently used in our machine learning models. Default, the Dutch
'Rijksdriehoekmeting' is used as projection (EPSG 28992), this can be
changed by the user.
In the next section we set our variables and download the data:
```{r}
# variables for data source, map layer name, and attribute
wfsuri <- "http://geodata.rivm.nl/geoserver/wfs?SERVICE=WFS&VERSION=1.0.0&REQUEST=GetFeature&TYPENAME=rivm:zorgatlas_gem_groen_2006&SRSNAME=EPSG:28992"
layername <- "zorgatlas_gem_groen_2006"
layerattribute <- "p_tevree" # name of attribute of interest
# datacube objectname
objname <- "groenbeleving" # name of raster object
objname.attr <- paste(objname,"_attr",sep="") # name of attribute table
#check layers
layers <- ogrListLayers(wfsuri)
# filename to store data:
fname.gpkg <- datafile("groenbeleving.gpkg") # see ?datacube::datafile
# get map layer, only if data does not exists (so not to waste network
# bandwidth). Store data as GeoPKG
if(!file.exists(fname.gpkg)) {
ogr2ogr(wfsuri,fname.gpkg,
layer=layername,
f="GPKG")
}
```
Now that we have our vector data, we can create a raster. We use the
datacube `createPixid` function to create a georeferenced reference
raster. Then we use the datacube dcrasterize function to rasterize our
vector data
```{r}
# create pixid reference raster
pixid <- createPixid()
writeRaster(pixid,datafile("pixid.grd"),overwrite=TRUE)
# read groenbeleving vector data, using sf package
m <- st_read(fname.gpkg,stringsAsFactors=FALSE)
# rasterize using pixid as reference
rastfile <- dcrasterize(obj=m,
layername,
attribute=layerattribute,
refraster=pixid)
# dcrasterize returns names of raster file and attribute table
fname.attr <- rastfile$attrfile
fname.grid <- rastfile$gridfile
# since we don't like the standard naming, we tidy the grid and
# attribute filenames.
# create raster filename
objname.grd <- datafile(paste(objname,"grd",sep="."))
x <- raster(fname.grid)
writeRaster(x,objname.grd,overwrite=TRUE)
# attribute is single file, file.rename will do
fname.attr.new <- datafile(paste(objname.attr,".rds",sep=""))
file.rename(fname.attr,fname.attr.new)
```
Next, we store our results in the datacube data repository. We add
extra meta data using the `kv` option.
```{r}
# write attribute table as blob object
f <- dcstore(filename=fname.attr.new,
obj=objname.attr,
kv=list(type="rds", # meta data key-value pairs
year="2006",
map=objname,
description="groenbeleving nationale zorgatlas 2006, attribute table"
)
)
dcstoreraster(gridfile=objname.grd,
blobname=objname,
kv=list(type="GTiff",
year="2006",
attributetable=objname.attr)
)
```
Using dcget and dcgetraster we can retrieve our data again
```{r}
file.remove(fname.attr.new)
b <- dcget(objname.attr)
```
After we got our object using `dcget` the variable `b` contains als
our meta data, including information about the script which created
the data:
```{r}
print(b$audit$parent$script)
print(b$audit$parent$project)
print(b$audit$parent$repo)
```
# Data directory
The datacube creates a distinct `data` directory in the current
workdir to store all te data. Since this markdown script is part of a
package we destroy the data directory.
```{r}
unlink("./data/",recursive=TRUE)
```