README.Rmd

---
title: "Read, filter and transform spectra and metadata using R data structures"
author: "Philipp Baumann // philipp.baumann@usys.ethz.ch"
date: "July 25, 2018"
output:
  github_document:
    toc: true
    toc_depth: 2
  html_document:
    fig_caption: yes
    number_sections: yes
    toc: true
    toc_depth: 2
  html_notebook:
    fig_caption: yes
    number_sections: yes
    toc: true
    toc_depth: 2
  pdf_document:
    fig_caption: yes
    number_sections: yes
    toc: true
    toc_depth: 2
    latex_engine: xelatex
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

---

For a html version of this tutorial, see [**here**](). You can also
[download this tutorial as pdf]().

# Getting started  

## Prerequisites

* First, you need to be familar how to work into R.
  [Here](https://github.com/philipp-baumann/simplerspec-teaching/blob/master/00_R-basics-spectro.md) 
  I give some recommendations to start mastering spectroscopy analysis tasks.
* You need to need to now that there are R basic data structures used for 
  analysis. Have a first read [here](http://adv-r.had.co.nz/Data-structures.html)
  in the *Advanced R* book of Hadley Wickham. 

## Topics and goals of this section

* You will learn how to use different R base data structures and basic
  operations such as subsetting to explore and transform spectral data.
  
* This hands-on tutorial teaches you the technical skills how to work with data
  structures in R by the example of manipulating spectral data.
  
## How to interactively go throught the tutorial

1. Use the `Clone or Download` button and clone/download this tutorial
repository from github to your computer.
2. Unzip the folder.
3. Double-click the `.RProj` file and the tutorial is loaded as R project
   (no need to use `setwd()` and 
   `path = "yourverylonganduniquedirectoryonyourpc"` or other nasty hard-coded
   approaches to set the working directory, which makes your projects not 
   reproducible for others; kindly follow these 
   [instructions](http://r4ds.had.co.nz/workflow-projects.html#rstudio-projects)
   to learn how to work with RStudio self-contained projects.
4. For working with this R markdown notebooks for reproducible and interactive
   analysis, see [here](https://rmarkdown.rstudio.com/index.html).
5. Have fun reproducing.

  
---
  
# Reading spectra from OPUS spectrometer files: prerequisites

Spectroscopy modeling requires that we first organize our spectra well. 
In particular, a proper and reproducible data management of spectral data,
metadata, and data from reference chemical analyses is key for all the 
subsequent data processing and modeling workflow.

The Sustainable Agroecosystems group at ETH relies on Diffuse Reflectance
Fourier Transform (DRIFT) infrared spectrometers manufactured by the company
*Bruker* (see Figure \ref{alpha_eth}). The manufacturer relies on a proprietary
binary format called *OPUS* to store an extensive amount of data that includes
different types of intermediary spectra. For each sample that was measured a 
single *OPUS* file is produced.

![Bruker ALPHA mid-IR spectrometer (diffuse reflectance Fourier transform infrared) of the Sustainable Agroecosystems group at ETH Zürich with a sample cup filled with soil. \label{alpha_eth}](figures/alpha_eth.jpg){width=200px}


# Reading spectrometer data into the R environment

First, we load the set of packages of `tidyverse` (see
[**here**](http://tidyverse.org/) for details) and the `simplerspec` package
(see [**here**](https://github.com/philipp-baumann/simplerspec/)). Simplerspec
contains a universal file reader that allows to  read selected parameters (e.g.
instrument, optic and acquisition parameters) and all types of spectra from a
single *OPUS* binary file or a list of files.

```{r}
# Load collection of packages that work together seamlessly for efficient
# analysis workflows
library("tidyverse")
# Package that facilitates spectral data handling, processing and modeling
library("simplerspec")
```

I recommended that you set up a self-contained directory where all R scripts,
data (spectra and chemical reference data), models, outputs (figures and text
files of data and model summaries), and predictions live. Further, you
can use the folder structure depicted in figure \ref{project_structure}
to organize your spectroscopy-related research projects.


![Recommended directory structure for spectroscopy modeling projects
\label{project_structure}](figures/project_folder_structure.png){width=70%}

When you have spectra that cover separate experiments and/or different locations
and times, you might prefer to organize your spectra as sub-folders within
`data/spectra`. This hands-on is based on spectral data that were used 
to build and evaluate the YAMSYS spectroscopy reference models. Besides these
reference spectra measured with a Bruker ALPHA mid-IR spectrometer at the 
Sustainable Agroecosystems group at ETH Zürich, there are other spectra that
have been acquired to test different questions such as spectrometer
cross-comparisons. Therefore, other comparison spectra are in separate paths,
e.g.  `data/spectra/soilspec_eth_bin`.

In the Figure below you can see a file explorer screenshot showing
*OPUS* files of three replicate scans for each of the first three reference soil
samples. *OPUS* have the extension `.n` where `n` represents an integer of
repeated sample measurements starting from 0.

![Screenshot showing replicate scans of first three samples reading example.
\label{spectra_files}](figures/spectra_files_to_read.png){width=400px}

We aim to read all the reference spectra contained within this folder. First, we
get the full path names of the file names, which we subsequently assign to the
object `files`:

```{r}
# Extract data from OPUS binary files; list of file paths
files <- list.files("data/spectra", full.names = TRUE)
```

Note that you need to set the `full.names` argument to `TRUE` (default is
`FALSE` to get the path of all *OPUS* spectra files contained within the target
directory, otherwise R will not be able to find the files when using the
universal `simplerspec` *OPUS* reader. 

You can compactly display the internal structure of the `files` object:

```{r}
str(files)
```

The object `files` has the data structure *atomic vector*. *Atomic vectors*
have six possible basic (*atomic*) vector types. These are *logical*, *integer*,
*real*, *complex*, *string* (or *character*) and *raw*. Vector types can be
returned by the R base function `typeof(x)`, which returns the type or internal
storage mode an object `x`. For the `files` object it is

```{r}
# Check type of files object
typeof(files)
```

We get the length of the vector or the number of elements by

```{r}
# How many files are listed to read? length of vector
length(files)
```

Base R has subsetting operations that allow you to extract pieces of data
structures you are interested in. One of the three base subsetting operators is
`[`.

We subset the character vector `files` as follows:

```{r}
# Use character subsetting to return the first element
# Subsetting can be seen as complement to str()
# (1) Subsetting with positive integers (position)
files[1:3]
# (2) Subsetting with negative integers (remove values)
head(files[-c(1:3)], n = 5L) # show only first 5 values
# The first three elements of the character vector are removed
```

## Spectral measurement data

Bruker FTIR spectrometers produce binary files in the OPUS format that can
contain different types of spectra and many parameters such as instrument type
and settings that were used at the time of data acquisition and internal
processing (e.g. Fourier transform operations). Basically, the entire set of
*Setup Measurement Parameters*, selected spectra, supplementary metadata such as
the time of measurement are written into *OPUS* binary files. In contrast to
simple text files that contain only plain text with a defined character
encoding, binary files can contain any type of data represented as sequences of
bytes (a single byte is sequence of 8 bits and 1 bit either represents 0 or 1).

Figure \ref{fig_instr_par} shows graphical representation from the *OPUS* viewer
software to get familiarize with types of parameters *OPUS* files may contain.

![Instrument parameters during sample measurement shown for an example YAMSYS
soil reference spectroscopy sample. Spectra and parameters can be shown by the
dialogue *Window* > *New Report Window* within the OPUS viewer software.
\label{fig_instr_par}](figures/opus_instrument_parameters_crop.png){width=65%}

You can download the *OPUS viewer* software from [**this Bruker
webpage**](https://www.bruker.com/products/infrared-near-infrared-and-raman-spectroscopy/opus-spectroscopy-software/downloads/opus-downloads.html)
for free. However, Bruker only provides a Windows version and the free version
is limited to visualize only final spectra. The remaining spectral blocks can be
checked choosing the menu *Window* > *New Report Window* and opening *OPUS* by
the menu *File* > *Load File*.

The types of spectra and associated data parameters that are saved after a
single measurement depend on the options that are selected in the *OPUS*
software. For data acquisition, the values under the tab *Advanced* of the
*Setup Measurement Parameters* menu window in the *OPUS* software.

Depending on the standard of a binary file, different regions in a file can be
interpreted differently by a program. For example, some information at some
block positions need to be interpreted as a certain type of number
representation whereas others are text. Hence, the interpretation of different
bit positions in the file requires either a priori knowledge provided by some
file specifications or extensive reverse-engineering.

Instead of sharing the full binary file specification, Bruker ships the *OPUS*
macro programming language or Microsoft Visual Basic scripts for automated data
acquisition and processing. However, this approaches are very inflexible and not
transparent, and therefore not reproducible. Hence, the idea of implementing
a file reader that is integrated in the R statistical programming environment
was targeted first in the `soil.spec` R package created by Andrew Sila (ICRAF,
Nairobi), Tomislav Hengl (ISRIC -- World Soil Information) and Thomas
Terhoeven-Urselmans (former member of ICRAF, Nairobi). `soil.spec` was created
based on the African Soil Information Services (AfSIS) project (see [here for
more information](http://africasoils.net/)). Because this reader worked only
when applying a restricted set of settings and procedures in OPUS, the idea came
up to modify and extend the previously mentioned `soil.spec::read.opus()`
function. This restriction is mainly due to the fact that positions where
spectra occur are not fixed and there is no evident accessible information about
the sequence of spectra and data parameters and the type of present spectra.
Therefore, I have been working extensively on a universal Bruker OPUS format
file reader that can correctly assign and read out different spectra types from
any type of Bruker FTIR spectrometer with different blocks saved and with and
without atmospheric compensation.

Simplerspec comes with reader function written in R, that is intended to be a
universal Bruker OPUS file reader that extract spectra and key metadata from
files. Usually, one is mostly interested to extract the final absorbance spectra
(shown as `AB` in the *OPUS viewer* software).

## Read the spectral data (OPUS files) as list into R

```{r, message = FALSE, cache=TRUE}
## Register parallel backend for using multiple cores

# Allows to tune the models using parallel processing (e.g. use all available
# cores of a CPU); caret package automatically detects the registered backend
library("doParallel")
# Make a cluster with all possible threads (more than physical cores)
cl <- makeCluster(detectCores())
# Register backend
registerDoParallel(cl)
# Return number of parallel workers
getDoParWorkers() # 8 threads on MacBook Pro (Retina, 15-inch, Mid 2015);
# Quadcore processor

# Read spectra and metadata of all binary OPUS files into a list
spc_list <- read_opus_univ(fnames = files, extract = c("spc"), parallel = TRUE)
```

## Use R functions to gain an overview of the spectral data

The extracted spectra and metadata data are within a list. A list is a very
flexible R data structure that can contain any other type of R objects. You can
think of lists as containers that help to save and transform objects. Lists can
contain hierarchically nested elements, e.g. a list can contain lists. In this
case, the list contains the following elements:

```{r}
# Return the names of a list; names() returns a character vector
# of the elemelement names and `[` extracts the first 10 names
names(spc_list)[1:10]
# The names from the spectral data list are identical to the
# files that were read (file names without path of folder where data
# are contained)
files[1:10]
```

**List subsetting/extraction of components:** Lists can be subsetted similar to
atomic vectors by the `[` and `[[` operators. The most important difference is
that `[` returns a list (sub-list of list) and `[[` returns the content of a
single component of the list (note that a single components can still contain
sub-lists). Based on the `[` operator we can extract spectra and metadata from a
single replicate measurement of a sample:

```{r}
# Display structure
str(spc_list["BF_lo_01_soil_cal.0"])
```

The above code extracts a list with one element that is named
`BF_lo_01_soil_cal.0`, whereas using `[[` shows the content of the subsetted
list and the name is not shown anymore. The content contains again 9 elements,
as `str()` reveals:

```{r}
str(spc_list[["BF_lo_01_soil_cal.0"]])
```

```{r}
head(spc_list[["BF_lo_01_soil_cal.0"]][["wavenumbers"]])
```


Using the function `ls()` returns a vector of character strings giving the
names of the list:

```{r}
# Show names of first hierarchy of list
ls(spc_list[["BF_lo_01_soil_cal.0"]])
```

To extract nested elements from a list, you can repeatedly apply subsetting
operators. Besides using name subsetting for named data structures that contain
a name attribute you can also use integers as index or logical vectors (TRUE and
FALSE).

```{r}
names(spc_list["BF_lo_01_soil_cal.0"]) # subset by name of first element
# subset by integer index, result is identical
names(spc_list[1]) == names(spc_list["BF_lo_01_soil_cal.0"])
```

For logical subsetting we create a new vector containing TRUE or FALSE that is
of same length as the spectra list `spc_list`. Usually logical type vectors are
returned when testing conditions using binary operators that allow the
comparison of values in atomic vectors (see R help for *relational operators* by
entering `?Comparision` in the R console; e.g. `<`, `>`, `==`). Here, we create
a logical vector `logical_subset` manually in order to illustrate that
subsetting also works with vectors of type logical:

```{r}
# repeat FALSE length(spc_list) times
logical_subset <- rep(FALSE, length(spc_list))


# Print subsetting vector
logical_subset
# Check type
typeof(logical_subset)
```

Subsetting and assignment can be combined to replace the third
element with `FALSE`:

```{r}
# Replace the 3rd element with TRUE; use subsetting and assignment
logical_subset[3] <- TRUE
```

Subsequently, we can use the newly created logical vector for subsetting
the spectral data list

```{r}
# Extract list with `[`; use str() to show a compact output that
# is nicely printed
str(spc_list[logical_subset]) # Returns positions that are TRUE, element 3
```


We can also test if the spectral list contains certain characters in the file
name by using pattern matching functions. If one has used the string `"_tb_"` as
part of the sample identifier to specify the sampling region in the file names,
we might be interested in selecting only spectra and metadata of these region
(`"_tb_"` stands for the site Tieningboué in Côte d'Ivoire for YAMSYS
spectroscopy reference samples).

```{r}
# Samples from site abbreviation "tb" (Tieningboué)
contains_tb <- grepl(pattern = "tb", x = names(spc_list))
# Show names of spectral data list elements that are returned by
# looking for "CI"
names(spc_list[contains_tb])
```

As the above example illustrates, only spectral data from files
containing the string `"tb"` are selected.


## data.table data frames for storing spectral data

Data frames are one of the basic R data structures.

When first reading spectral data from binary OPUS files, simplerspec returns
`̀data.tables`s of final spectra (`AB` block in OPUS viewer software).

```{r}
# Extract spectrum from file "BF_lo_01_soil_cal.0"
# and get overview of the data structure
str(spc_list[["BF_lo_01_soil_cal.0"]][["spc"]])
```

You can test if the above output has the class `data.frame` with

```{r}
is.data.frame(spc_list[["BF_lo_01_soil_cal.0"]][["spc"]])
```

As the output `TRUE` indicates, the selected spectrum from the list is a
data frame.

You can get the number of rows and columns of a data.table by

```{r}
# Assign data.table to object
spc_dt <- spc_list[["BF_lo_01_soil_cal.0"]][["spc"]]
nrow(spc_dt)
ncol(spc_dt)
```

The spectral data.table within the file `"BF_lo_01_soil_cal.0"` has 
`r nrow(spc_dt)` rows and `r ncol(spc_dt)` columns. The columns correspond to 
wavenumber variables.

Data frames have a `dimnames` attribute that names columns and
rows:

```{r}
# Show row name and only first and last 10 column names
rownames(spc_dt)
idx_firstandlast10 <- c(1:10, seq(from = ncol(spc_dt) - 10, ncol(spc_dt), 1))
colnames(spc_dt)[idx_firstandlast10]
```

You can also get dimension names in list form
```{r}
# Show row and column names as list
str(dimnames(spc_dt))
```

**Subsetting data frames :** Data frame subsetting operations allow you to 
extract parts of values stored within a data frame that you are interested in. 
The basic syntax is that you can use `[` and supply a 1D index for both rows and
columns, separated by a comma. Blank subsetting without an index value keeps all
rows or columns:

```{r}
# Show columns 2, 3 and 6
spc_dt[, c(2, 3, 6)]
```

The first index before the comma is the row index and the second the column
index. Omitting the row index shows all rows. In the case of data.table `spc_dt`
there is only one row. For exemplifying the subsetting behavior of matrices
we can duplicate the data.table `spc_dt` and copy the same content into a second
row using `rbind()`, which is a generic function to combine objects by rows
(equivalent for columns is `cbind()`):


```{r}
spc_dt2 <- rbind(spc_dt, spc_dt)
# Check dimensions
dim(spc_dt2)
```

Now we can e.g. replace the first value in the second row by first selecting the
value in the second row of the first column by 1 and then assigning the number 1
to it. This will modify the value at the selection position in the previous
data frame in place.

```{r}
spc_dt2[2, 1] # extract 2nd row and first column
# Subset and modify by assignment
spc_dt2[2, 1] <- 1
# Check if value at selected position has been replaced
spc_dt2[2, 1]
```

The above code shows that the selected value has been correctly replaced. It is
also possible to only show the first row of `spc_dt2`, by leaving the column
index empty:

```{r}
head(spc_dt[1, ]) # Show only first 10 values (default of head())
```

The first column of the first row has still the value `r spc_dt[1, 1]`.

```{r}
# Check if dimnames attribute is present
str(attributes(spc_dt))
```

We can e.g. also select the first column by its name, which is commonly known in
R as **name subsetting**

```{r}
spc_dt[, "3997.4"] # Select first column with wavenumber variable "3997.4"
spc_dt[, "3997.4"] == spc_dt[, 1] # both are equivalent
```


## Advantages of lists

**Applying the same function to all elements of a list:** Lists are data 
structures that allow to store complex, hierarchical objects. Lists are
fundamental units when applying functions on each elements using apply family
functions. Apply family functions are a specific type of *functionals* that take
functions and other objects as input and return lists or atomic vectors. Note
that lists are also vectors. Functionals are an elegant way to solve common data
manipulation tasks. A often used functional is `lapply()`. The functional
`lapply(X, FUN, ...)` applies a function `FUN` to each of the corresponding
elements of `X` and returns the result as a list of the same length as its input
`X`. The argument `...` can be other arguments passed to the function.  Let us
explore the behavior of `lapply()` on a simple example. A list shall contain
three different the numeric vectors named `"a"`, `"b"`, and `"c"`.

```{r}
x <- list(
  "a" = 1:10,
  "b" = c(0.5, 2.3, 5),
  "c" = seq(0.1, 1, 0.1)
)
x
```

We can simply calculate the mean value for all the elements (vectors) in the
list `x` by using `lapply()`:

```{r}
lapply(x, mean) # First argument is X, second is FUN;
# you can also supply arguments explicitly with lapply(X = x, FUN = mean)
```

As you can see, the mean value is computed for the elements `'a`, `'b'`, and
`'c'` and returned as list.

We can also remove the hierarchy from the list and returning it as named numeric
vector using `unlist()`. `unlist()` concatenates all elements of all components
into a single vector:

```{r}
mean_v <- unlist(lapply(x, mean))
str(mean_v)
```


# Session info

```{r}
sessionInfo()
```