README.Rmd

---
output: rmarkdown::github_document
---

# Introduction

The RIVM datacube is a data repository aimed at spatial grid based data,
although it can manage any kind of data. It is used in our data-science
projects involving spatial analyses, modeling and prediction. It is
suitable for combining field measurements, e.g. from monitoring networks,
with other spatial data (e.g. soil types, groundwater levels, land use,
altitude, crops,  emission data etc.). All the spatial data in the
datacube is georeferenced to the same extent, so maps can be stacked
easily. These stacks are the actual datacubes we use in our machine
learning models.

This package contains functions to work with GIS data and PostGIS, it
manages data by storing  data to the repository, generate
meta-data, creates audit trails or data lineage paths, and stores
versioning info. It is aimed at small teams working together with the
same data. The datacube package makes it possible to work on projects 
which are reusable, reproducible and auditable.

# Important

*Please note:* this is work in progress. This package needs git and
PostgreSQL/PostGIS for proper working. It assumes a Linux OS (it might
work under Windows but we never tried).

Installation in R , using the devtools package:
`devtools::install_github("jspijker/datacube")`
Please make sure you already have the fasterize and the here package
installed.

## Future development

I presented this package at the useR!2019 conference. During the
discussions I noticed that people really like the idea of a data
repository and to have the possibility to create an audit trail for
their data. However, the name of this package, datacube, is confusing.
We use this for datacubes but it is not limited to datacubes, it can
be used for any data or data-workflow.

To make this package more useful for a general R audience, we will split
this package into two parts. We keep all the stuff about datacubes and
spatial rasters in our datacube package. It's the stuff we love and use,
and it 'works for us'. The stuff about the data repository, audit trails
etc. will go into a separate package, so it can be of use for others.

# Dependencies

The datacube package uses 3 other related packages: 

 1. pgobjects is a package to store R objects, either variables, functions, or complete environments, into a PostgreSQL database. 
 2. The pgblobs package is an extension of pgobjects. If objects are to big to store in the database, like raster grids or spatial data, only the meta data is stored in the database and the file, or blob, is stored on a shared disk location. 
 3.   The localoptions package is used to read an options file with the datacube configuration:

These packages can be found on github:

 * [gobjects](https://github.com/jspijker/pgobjects)
 * [pgblobs](https://github.com/jspijker/pgblobs)
 * [localoptions](https://github.com/jspijker/localoptions)


# Configuration

For the configuration of the system the localoptions package is used.
With localoptions an option file is read with the database
configuration and file locations. The default location of this option
file is ~/.R.options and looks like this:

```
# database host, database name, user, and password
datacube.host localhost
datacube.dbname datacube
datacube.user username
datacube.password verysecret

# database schema for pgobjects tables (default is public)
datacube.schema datacube

# location to store file blobs (shared network location)
datacube.blobs /datacube/blobs

```


After a correct configuration, you can initialize database and create
the necessary tables:

```{r,eval=FALSE}

library(pgobjects)
createPgobjTables()

```

# initialization

After the database is setup, one can load the necessary packages.
We prefer to use pacman for that.

```{r}

# load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(parallel,raster,ggplot2,sp,maptools,RCurl,
               RPostgreSQL,rgdal,gdalUtils,sf,fasterize,foreign,tidyverse,here)

# packages on github to use
pacman::p_load_gh("jspijker/localoptions","jspijker/pgobjects",
	     "jspijker/pgblobs")

# load datacube
devtools::load_all()


```

After setting up the configuration, you can create your first project.
Projects using the datacube package are organised withing git
repositories. Each repository can contain multiple projects. For each
project the user has to create a separate directory at the root of the
repository. 

Then the datacube is initialized, using the name of project (workdir)
and the name of the script.

Part of the initialization is the setup of the database connection.
Also a data directory is created and meta data about the git
repository is collected. The working directory is changed to the
project directory. Don't you dare to use `setwd()` in your scripts.


```{r}
# initialize datacube
datacubeInit(script="README.Rmd",workdir=".")

```

# Import, tidy and transform data

To demonstrate the datacube we import data from a source location, and
then tidy and transform it. For this demonstration we use the 'groenbeleving' indicator from
the Dutch Health Atlas. This indicator is about the percentage of people
within a municipality who are satisfied about the amount of green area
in their living environment. The data is published as WFS service in
the RIVM geoservice.

We'll import the spatial vector data and then transform it to a
georeferenced 25x25m raster. The used georeference is a standard
georeference used for all the raster layers, so all the layers can be
stacked into a multidimensional raster stack. This stack is
subsequently used in our machine learning models. Default, the Dutch
'Rijksdriehoekmeting' is used as projection (EPSG 28992), this can be
changed by the user.

In the next section we set our variables and download the data:

```{r}

# variables for data source, map layer name, and attribute
wfsuri <- "http://geodata.rivm.nl/geoserver/wfs?SERVICE=WFS&VERSION=1.0.0&REQUEST=GetFeature&TYPENAME=rivm:zorgatlas_gem_groen_2006&SRSNAME=EPSG:28992"
layername <- "zorgatlas_gem_groen_2006"
layerattribute <- "p_tevree" # name of attribute of interest

# datacube objectname
objname <- "groenbeleving" # name of raster object
objname.attr <- paste(objname,"_attr",sep="") # name of attribute table


#check layers
layers <- ogrListLayers(wfsuri)

# filename to store data:
fname.gpkg <- datafile("groenbeleving.gpkg") # see ?datacube::datafile

# get map layer, only if data does not exists (so not to waste network
# bandwidth). Store data as GeoPKG

if(!file.exists(fname.gpkg)) {
    ogr2ogr(wfsuri,fname.gpkg,
            layer=layername,
            f="GPKG")
}

```

Now that we have our vector data, we can create a raster. We use the
datacube `createPixid` function to create a georeferenced reference
raster. Then we use the datacube dcrasterize function to rasterize our
vector data

```{r}

# create pixid reference raster
pixid <- createPixid()
writeRaster(pixid,datafile("pixid.grd"),overwrite=TRUE)

# read groenbeleving vector data, using sf package
m <- st_read(fname.gpkg,stringsAsFactors=FALSE)

# rasterize using pixid as reference
rastfile <- dcrasterize(obj=m, 
                        layername,
                        attribute=layerattribute,
                        refraster=pixid)

# dcrasterize returns names of raster file and attribute table
fname.attr <- rastfile$attrfile
fname.grid <- rastfile$gridfile


# since we don't like the standard naming, we tidy the grid and
# attribute filenames.

# create raster filename
objname.grd <- datafile(paste(objname,"grd",sep="."))
x <- raster(fname.grid)
writeRaster(x,objname.grd,overwrite=TRUE)

# attribute is single file, file.rename will do
fname.attr.new <- datafile(paste(objname.attr,".rds",sep=""))
file.rename(fname.attr,fname.attr.new)


```

Next, we store our results in the datacube data repository. We add
extra meta data using the `kv` option.


```{r}


# write attribute table as blob object
f <- dcstore(filename=fname.attr.new,
        obj=objname.attr,
        kv=list(type="rds", # meta data key-value pairs
                year="2006",
                map=objname,
                description="groenbeleving nationale zorgatlas 2006, attribute table"
                )
        )

dcstoreraster(gridfile=objname.grd,
              blobname=objname,
              kv=list(type="GTiff",
                      year="2006",
                      attributetable=objname.attr)
              )

```

Using dcget and dcgetraster we can retrieve our data again

```{r}

file.remove(fname.attr.new) 
b <- dcget(objname.attr)

```

After we got our object using `dcget` the variable `b` contains als
our meta data, including information about the script which created
the data:

```{r}
print(b$audit$parent$script)
print(b$audit$parent$project)
print(b$audit$parent$repo)
```

# Data directory

The datacube creates a distinct `data` directory in the current
workdir to store all te data. Since this markdown script is part of a
package we destroy the data directory.

```{r}
unlink("./data/",recursive=TRUE)
```