-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
563 lines (439 loc) · 21 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
---
title: "Read, filter and transform spectra and metadata using R data structures"
author: "Philipp Baumann // [email protected]"
date: "July 25, 2018"
output:
github_document:
toc: true
toc_depth: 2
html_document:
fig_caption: yes
number_sections: yes
toc: true
toc_depth: 2
html_notebook:
fig_caption: yes
number_sections: yes
toc: true
toc_depth: 2
pdf_document:
fig_caption: yes
number_sections: yes
toc: true
toc_depth: 2
latex_engine: xelatex
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
---
For a html version of this tutorial, see [**here**](). You can also
[download this tutorial as pdf]().
# Getting started
## Prerequisites
* First, you need to be familar how to work into R.
[Here](https://github.com/philipp-baumann/simplerspec-teaching/blob/master/00_R-basics-spectro.md)
I give some recommendations to start mastering spectroscopy analysis tasks.
* You need to need to now that there are R basic data structures used for
analysis. Have a first read [here](http://adv-r.had.co.nz/Data-structures.html)
in the *Advanced R* book of Hadley Wickham.
## Topics and goals of this section
* You will learn how to use different R base data structures and basic
operations such as subsetting to explore and transform spectral data.
* This hands-on tutorial teaches you the technical skills how to work with data
structures in R by the example of manipulating spectral data.
## How to interactively go throught the tutorial
1. Use the `Clone or Download` button and clone/download this tutorial
repository from github to your computer.
2. Unzip the folder.
3. Double-click the `.RProj` file and the tutorial is loaded as R project
(no need to use `setwd()` and
`path = "yourverylonganduniquedirectoryonyourpc"` or other nasty hard-coded
approaches to set the working directory, which makes your projects not
reproducible for others; kindly follow these
[instructions](http://r4ds.had.co.nz/workflow-projects.html#rstudio-projects)
to learn how to work with RStudio self-contained projects.
4. For working with this R markdown notebooks for reproducible and interactive
analysis, see [here](https://rmarkdown.rstudio.com/index.html).
5. Have fun reproducing.
---
# Reading spectra from OPUS spectrometer files: prerequisites
Spectroscopy modeling requires that we first organize our spectra well.
In particular, a proper and reproducible data management of spectral data,
metadata, and data from reference chemical analyses is key for all the
subsequent data processing and modeling workflow.
The Sustainable Agroecosystems group at ETH relies on Diffuse Reflectance
Fourier Transform (DRIFT) infrared spectrometers manufactured by the company
*Bruker* (see Figure \ref{alpha_eth}). The manufacturer relies on a proprietary
binary format called *OPUS* to store an extensive amount of data that includes
different types of intermediary spectra. For each sample that was measured a
single *OPUS* file is produced.
{width=200px}
# Reading spectrometer data into the R environment
First, we load the set of packages of `tidyverse` (see
[**here**](http://tidyverse.org/) for details) and the `simplerspec` package
(see [**here**](https://github.com/philipp-baumann/simplerspec/)). Simplerspec
contains a universal file reader that allows to read selected parameters (e.g.
instrument, optic and acquisition parameters) and all types of spectra from a
single *OPUS* binary file or a list of files.
```{r}
# Load collection of packages that work together seamlessly for efficient
# analysis workflows
library("tidyverse")
# Package that facilitates spectral data handling, processing and modeling
library("simplerspec")
```
I recommended that you set up a self-contained directory where all R scripts,
data (spectra and chemical reference data), models, outputs (figures and text
files of data and model summaries), and predictions live. Further, you
can use the folder structure depicted in figure \ref{project_structure}
to organize your spectroscopy-related research projects.
{width=70%}
When you have spectra that cover separate experiments and/or different locations
and times, you might prefer to organize your spectra as sub-folders within
`data/spectra`. This hands-on is based on spectral data that were used
to build and evaluate the YAMSYS spectroscopy reference models. Besides these
reference spectra measured with a Bruker ALPHA mid-IR spectrometer at the
Sustainable Agroecosystems group at ETH Zürich, there are other spectra that
have been acquired to test different questions such as spectrometer
cross-comparisons. Therefore, other comparison spectra are in separate paths,
e.g. `data/spectra/soilspec_eth_bin`.
In the Figure below you can see a file explorer screenshot showing
*OPUS* files of three replicate scans for each of the first three reference soil
samples. *OPUS* have the extension `.n` where `n` represents an integer of
repeated sample measurements starting from 0.
{width=400px}
We aim to read all the reference spectra contained within this folder. First, we
get the full path names of the file names, which we subsequently assign to the
object `files`:
```{r}
# Extract data from OPUS binary files; list of file paths
files <- list.files("data/spectra", full.names = TRUE)
```
Note that you need to set the `full.names` argument to `TRUE` (default is
`FALSE` to get the path of all *OPUS* spectra files contained within the target
directory, otherwise R will not be able to find the files when using the
universal `simplerspec` *OPUS* reader.
You can compactly display the internal structure of the `files` object:
```{r}
str(files)
```
The object `files` has the data structure *atomic vector*. *Atomic vectors*
have six possible basic (*atomic*) vector types. These are *logical*, *integer*,
*real*, *complex*, *string* (or *character*) and *raw*. Vector types can be
returned by the R base function `typeof(x)`, which returns the type or internal
storage mode an object `x`. For the `files` object it is
```{r}
# Check type of files object
typeof(files)
```
We get the length of the vector or the number of elements by
```{r}
# How many files are listed to read? length of vector
length(files)
```
Base R has subsetting operations that allow you to extract pieces of data
structures you are interested in. One of the three base subsetting operators is
`[`.
We subset the character vector `files` as follows:
```{r}
# Use character subsetting to return the first element
# Subsetting can be seen as complement to str()
# (1) Subsetting with positive integers (position)
files[1:3]
# (2) Subsetting with negative integers (remove values)
head(files[-c(1:3)], n = 5L) # show only first 5 values
# The first three elements of the character vector are removed
```
## Spectral measurement data
Bruker FTIR spectrometers produce binary files in the OPUS format that can
contain different types of spectra and many parameters such as instrument type
and settings that were used at the time of data acquisition and internal
processing (e.g. Fourier transform operations). Basically, the entire set of
*Setup Measurement Parameters*, selected spectra, supplementary metadata such as
the time of measurement are written into *OPUS* binary files. In contrast to
simple text files that contain only plain text with a defined character
encoding, binary files can contain any type of data represented as sequences of
bytes (a single byte is sequence of 8 bits and 1 bit either represents 0 or 1).
Figure \ref{fig_instr_par} shows graphical representation from the *OPUS* viewer
software to get familiarize with types of parameters *OPUS* files may contain.
{width=65%}
You can download the *OPUS viewer* software from [**this Bruker
webpage**](https://www.bruker.com/products/infrared-near-infrared-and-raman-spectroscopy/opus-spectroscopy-software/downloads/opus-downloads.html)
for free. However, Bruker only provides a Windows version and the free version
is limited to visualize only final spectra. The remaining spectral blocks can be
checked choosing the menu *Window* > *New Report Window* and opening *OPUS* by
the menu *File* > *Load File*.
The types of spectra and associated data parameters that are saved after a
single measurement depend on the options that are selected in the *OPUS*
software. For data acquisition, the values under the tab *Advanced* of the
*Setup Measurement Parameters* menu window in the *OPUS* software.
Depending on the standard of a binary file, different regions in a file can be
interpreted differently by a program. For example, some information at some
block positions need to be interpreted as a certain type of number
representation whereas others are text. Hence, the interpretation of different
bit positions in the file requires either a priori knowledge provided by some
file specifications or extensive reverse-engineering.
Instead of sharing the full binary file specification, Bruker ships the *OPUS*
macro programming language or Microsoft Visual Basic scripts for automated data
acquisition and processing. However, this approaches are very inflexible and not
transparent, and therefore not reproducible. Hence, the idea of implementing
a file reader that is integrated in the R statistical programming environment
was targeted first in the `soil.spec` R package created by Andrew Sila (ICRAF,
Nairobi), Tomislav Hengl (ISRIC -- World Soil Information) and Thomas
Terhoeven-Urselmans (former member of ICRAF, Nairobi). `soil.spec` was created
based on the African Soil Information Services (AfSIS) project (see [here for
more information](http://africasoils.net/)). Because this reader worked only
when applying a restricted set of settings and procedures in OPUS, the idea came
up to modify and extend the previously mentioned `soil.spec::read.opus()`
function. This restriction is mainly due to the fact that positions where
spectra occur are not fixed and there is no evident accessible information about
the sequence of spectra and data parameters and the type of present spectra.
Therefore, I have been working extensively on a universal Bruker OPUS format
file reader that can correctly assign and read out different spectra types from
any type of Bruker FTIR spectrometer with different blocks saved and with and
without atmospheric compensation.
Simplerspec comes with reader function written in R, that is intended to be a
universal Bruker OPUS file reader that extract spectra and key metadata from
files. Usually, one is mostly interested to extract the final absorbance spectra
(shown as `AB` in the *OPUS viewer* software).
## Read the spectral data (OPUS files) as list into R
```{r, message = FALSE, cache=TRUE}
## Register parallel backend for using multiple cores
# Allows to tune the models using parallel processing (e.g. use all available
# cores of a CPU); caret package automatically detects the registered backend
library("doParallel")
# Make a cluster with all possible threads (more than physical cores)
cl <- makeCluster(detectCores())
# Register backend
registerDoParallel(cl)
# Return number of parallel workers
getDoParWorkers() # 8 threads on MacBook Pro (Retina, 15-inch, Mid 2015);
# Quadcore processor
# Read spectra and metadata of all binary OPUS files into a list
spc_list <- read_opus_univ(fnames = files, extract = c("spc"), parallel = TRUE)
```
## Use R functions to gain an overview of the spectral data
The extracted spectra and metadata data are within a list. A list is a very
flexible R data structure that can contain any other type of R objects. You can
think of lists as containers that help to save and transform objects. Lists can
contain hierarchically nested elements, e.g. a list can contain lists. In this
case, the list contains the following elements:
```{r}
# Return the names of a list; names() returns a character vector
# of the elemelement names and `[` extracts the first 10 names
names(spc_list)[1:10]
# The names from the spectral data list are identical to the
# files that were read (file names without path of folder where data
# are contained)
files[1:10]
```
**List subsetting/extraction of components:** Lists can be subsetted similar to
atomic vectors by the `[` and `[[` operators. The most important difference is
that `[` returns a list (sub-list of list) and `[[` returns the content of a
single component of the list (note that a single components can still contain
sub-lists). Based on the `[` operator we can extract spectra and metadata from a
single replicate measurement of a sample:
```{r}
# Display structure
str(spc_list["BF_lo_01_soil_cal.0"])
```
The above code extracts a list with one element that is named
`BF_lo_01_soil_cal.0`, whereas using `[[` shows the content of the subsetted
list and the name is not shown anymore. The content contains again 9 elements,
as `str()` reveals:
```{r}
str(spc_list[["BF_lo_01_soil_cal.0"]])
```
```{r}
head(spc_list[["BF_lo_01_soil_cal.0"]][["wavenumbers"]])
```
Using the function `ls()` returns a vector of character strings giving the
names of the list:
```{r}
# Show names of first hierarchy of list
ls(spc_list[["BF_lo_01_soil_cal.0"]])
```
To extract nested elements from a list, you can repeatedly apply subsetting
operators. Besides using name subsetting for named data structures that contain
a name attribute you can also use integers as index or logical vectors (TRUE and
FALSE).
```{r}
names(spc_list["BF_lo_01_soil_cal.0"]) # subset by name of first element
# subset by integer index, result is identical
names(spc_list[1]) == names(spc_list["BF_lo_01_soil_cal.0"])
```
For logical subsetting we create a new vector containing TRUE or FALSE that is
of same length as the spectra list `spc_list`. Usually logical type vectors are
returned when testing conditions using binary operators that allow the
comparison of values in atomic vectors (see R help for *relational operators* by
entering `?Comparision` in the R console; e.g. `<`, `>`, `==`). Here, we create
a logical vector `logical_subset` manually in order to illustrate that
subsetting also works with vectors of type logical:
```{r}
# repeat FALSE length(spc_list) times
logical_subset <- rep(FALSE, length(spc_list))
# Print subsetting vector
logical_subset
# Check type
typeof(logical_subset)
```
Subsetting and assignment can be combined to replace the third
element with `FALSE`:
```{r}
# Replace the 3rd element with TRUE; use subsetting and assignment
logical_subset[3] <- TRUE
```
Subsequently, we can use the newly created logical vector for subsetting
the spectral data list
```{r}
# Extract list with `[`; use str() to show a compact output that
# is nicely printed
str(spc_list[logical_subset]) # Returns positions that are TRUE, element 3
```
We can also test if the spectral list contains certain characters in the file
name by using pattern matching functions. If one has used the string `"_tb_"` as
part of the sample identifier to specify the sampling region in the file names,
we might be interested in selecting only spectra and metadata of these region
(`"_tb_"` stands for the site Tieningboué in Côte d'Ivoire for YAMSYS
spectroscopy reference samples).
```{r}
# Samples from site abbreviation "tb" (Tieningboué)
contains_tb <- grepl(pattern = "tb", x = names(spc_list))
# Show names of spectral data list elements that are returned by
# looking for "CI"
names(spc_list[contains_tb])
```
As the above example illustrates, only spectral data from files
containing the string `"tb"` are selected.
## data.table data frames for storing spectral data
Data frames are one of the basic R data structures.
When first reading spectral data from binary OPUS files, simplerspec returns
`̀data.tables`s of final spectra (`AB` block in OPUS viewer software).
```{r}
# Extract spectrum from file "BF_lo_01_soil_cal.0"
# and get overview of the data structure
str(spc_list[["BF_lo_01_soil_cal.0"]][["spc"]])
```
You can test if the above output has the class `data.frame` with
```{r}
is.data.frame(spc_list[["BF_lo_01_soil_cal.0"]][["spc"]])
```
As the output `TRUE` indicates, the selected spectrum from the list is a
data frame.
You can get the number of rows and columns of a data.table by
```{r}
# Assign data.table to object
spc_dt <- spc_list[["BF_lo_01_soil_cal.0"]][["spc"]]
nrow(spc_dt)
ncol(spc_dt)
```
The spectral data.table within the file `"BF_lo_01_soil_cal.0"` has
`r nrow(spc_dt)` rows and `r ncol(spc_dt)` columns. The columns correspond to
wavenumber variables.
Data frames have a `dimnames` attribute that names columns and
rows:
```{r}
# Show row name and only first and last 10 column names
rownames(spc_dt)
idx_firstandlast10 <- c(1:10, seq(from = ncol(spc_dt) - 10, ncol(spc_dt), 1))
colnames(spc_dt)[idx_firstandlast10]
```
You can also get dimension names in list form
```{r}
# Show row and column names as list
str(dimnames(spc_dt))
```
**Subsetting data frames :** Data frame subsetting operations allow you to
extract parts of values stored within a data frame that you are interested in.
The basic syntax is that you can use `[` and supply a 1D index for both rows and
columns, separated by a comma. Blank subsetting without an index value keeps all
rows or columns:
```{r}
# Show columns 2, 3 and 6
spc_dt[, c(2, 3, 6)]
```
The first index before the comma is the row index and the second the column
index. Omitting the row index shows all rows. In the case of data.table `spc_dt`
there is only one row. For exemplifying the subsetting behavior of matrices
we can duplicate the data.table `spc_dt` and copy the same content into a second
row using `rbind()`, which is a generic function to combine objects by rows
(equivalent for columns is `cbind()`):
```{r}
spc_dt2 <- rbind(spc_dt, spc_dt)
# Check dimensions
dim(spc_dt2)
```
Now we can e.g. replace the first value in the second row by first selecting the
value in the second row of the first column by 1 and then assigning the number 1
to it. This will modify the value at the selection position in the previous
data frame in place.
```{r}
spc_dt2[2, 1] # extract 2nd row and first column
# Subset and modify by assignment
spc_dt2[2, 1] <- 1
# Check if value at selected position has been replaced
spc_dt2[2, 1]
```
The above code shows that the selected value has been correctly replaced. It is
also possible to only show the first row of `spc_dt2`, by leaving the column
index empty:
```{r}
head(spc_dt[1, ]) # Show only first 10 values (default of head())
```
The first column of the first row has still the value `r spc_dt[1, 1]`.
```{r}
# Check if dimnames attribute is present
str(attributes(spc_dt))
```
We can e.g. also select the first column by its name, which is commonly known in
R as **name subsetting**
```{r}
spc_dt[, "3997.4"] # Select first column with wavenumber variable "3997.4"
spc_dt[, "3997.4"] == spc_dt[, 1] # both are equivalent
```
## Advantages of lists
**Applying the same function to all elements of a list:** Lists are data
structures that allow to store complex, hierarchical objects. Lists are
fundamental units when applying functions on each elements using apply family
functions. Apply family functions are a specific type of *functionals* that take
functions and other objects as input and return lists or atomic vectors. Note
that lists are also vectors. Functionals are an elegant way to solve common data
manipulation tasks. A often used functional is `lapply()`. The functional
`lapply(X, FUN, ...)` applies a function `FUN` to each of the corresponding
elements of `X` and returns the result as a list of the same length as its input
`X`. The argument `...` can be other arguments passed to the function. Let us
explore the behavior of `lapply()` on a simple example. A list shall contain
three different the numeric vectors named `"a"`, `"b"`, and `"c"`.
```{r}
x <- list(
"a" = 1:10,
"b" = c(0.5, 2.3, 5),
"c" = seq(0.1, 1, 0.1)
)
x
```
We can simply calculate the mean value for all the elements (vectors) in the
list `x` by using `lapply()`:
```{r}
lapply(x, mean) # First argument is X, second is FUN;
# you can also supply arguments explicitly with lapply(X = x, FUN = mean)
```
As you can see, the mean value is computed for the elements `'a`, `'b'`, and
`'c'` and returned as list.
We can also remove the hierarchy from the list and returning it as named numeric
vector using `unlist()`. `unlist()` concatenates all elements of all components
into a single vector:
```{r}
mean_v <- unlist(lapply(x, mean))
str(mean_v)
```
# Session info
```{r}
sessionInfo()
```