forked from annakrystalli/rmacroRDM
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvignette.Rmd
688 lines (378 loc) · 25.7 KB
/
vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
---
title: "rmacroRDM"
author:
date:
output: md_document
---
### WARNING!! Functionality in the packages has been significantly developed, rendering parts of this vignette deprecated. However much of the background information remains relevant. See [temporary vignette](https://rawgit.com/annakrystalli/rmacroRDM/master/temp_vignette.nb.html) for partial demo of current functionality. Updates to the vignette to follow [\#](https://github.com/annakrystalli/rmacroRDM/issues/15)
The rmacroRDM package contains functions to help with the compilation of macroecological datasets. It compiles datasets into a **master long database of individual observations**, matched to a specified **master species list**. It also *checks, separates and stores taxonomic and metadata information* on the *observations*, *variables* and *datasets* contained in the data. It therefore aims to ensure full traceability of datapoints and as robust quality control, all the way through to the extracted analytical datasets.
The idea is to enforce a basic level of data management and quality control and bundling it with important metadata for both each stage in data processing and compilation and **`[[master]]` outputs**. Managing data in such a way makes validating, understanding, analysing, visualising and communicating much easier.
Standardisation allows data to be shared and build upon more easily and with higher robustness. It also allows more interactivity as apps can be built around **rmacroRDM data outputs** to facilitate data exploration, validation, access and reporting. It also allows data to be shared and build upon more easily and with higher robustness.
### **rmacroRDM `[master]` dataset**
The overall purpose of the functions in the package are to compile macroecological trait datasets into **a master database of observations (T1)**. This allows information to be stored with individual datapoints, allowing for better quality control and traceability.
Metadata information on individual data points stored in the **long `master` dataset** is defined by assigning ***observation meta-variables `{meta.vars}`***. Information on the taxonomic matching of datapoints through synonyms is also stored and is defined through **match variables `{match.vars}`**. In the example below, "*species*", *var*", "*value*" identify and contain each oservation, "*data.status*" "*qc*","*observer*", "*ref*" and "*n*" are the default **`meta.vars`** and "*synonyms*" and "*data.status*" are the default **`match.var`**.
```{r, warning=F, message=FALSE, echo=FALSE}
### SETTINGS ##############################################################
options(stringsAsFactors = F)
output.folder <- "~/Documents/workflows/rmacroRDM/data/output/"
input.folder <- "~/Documents/workflows/rmacroRDM/data/input/"
script.folder <- "~/Documents/workflows/rmacroRDM/R/"
# Functions & Packages
require(knitr)
require(dplyr)
# source rmacroRDM functions
source(paste(script.folder, "functions.R", sep = ""))
source(paste(script.folder, "wideData_function.R", sep = ""))
# FILES
D1 <- read.csv(paste(input.folder, "csv/D1.csv", sep = ""), fileEncoding = "mac")
metadata <- read.csv(paste(input.folder, "metadata/metadata.csv", sep = ""), fileEncoding = "mac") %>% apply(2, FUN = trimws) %>% data.frame(stringsAsFactors = F)
synonyms <- read.csv(paste(input.folder,"taxo/synonyms.csv", sep = ""), stringsAsFactors=FALSE)
syn.links <- synonyms[!duplicated(t(apply(synonyms[,1:2], 1, sort))),1:2]
master <- read.csv(paste(output.folder, "master.csv", sep = ""), fileEncoding = "mac")[1:8,]
master$ref <- paste(substr(master$ref, 1, 10), "...")
meta.vars = c("qc", "observer", "ref", "n", "notes")
taxo.var <- c("species", "order","family")
var.vars <- c("var", "value", "data")
var.omit <- c("no_sex_maturity_d", "adult_svl_cm", "male_maturity_d")
match.vars <- c("synonyms", "data.status")
master.vars <- c("species", match.vars, var.vars, meta.vars)
kable(master,caption = "T1: example master data sheet")
```
This framework also can handle multiple intraspecific datapoints for individual `{vars}` allowing users to build up information of trait intraspecific variation across observations. Observation metadata enables quality control to filter data supplied to analytical datasets.
***
<br>
## match objects {`m`}
Functions in the rmacroRDM package have been designed to receive and update a match object (**`m`**). This helps keep all the information relating to the matching of a particular dataset together, updated at the same time and available and updated at each stage.
Match objects `[[m]]` are defined by the function **`matchObj()`**
```{r, eval=FALSE}
m <- matchObj(data.ID, spp.list, data, status = "unmatched",
sub = "data", meta, filename)
```
and have the following elements:
```{r, warning=F, message=FALSE, echo=FALSE}
load(paste(output.folder, "D1m.RData", sep = ""))
m$unmatched <- NULL
names(m)
```
***
### **`[[m]]` structure**
#### **`"data.ID"`**
a character vector of the dataset code *(eg. `"D1"`)*
#### **`[spp.list]`**
dataframe containing the master species list to which all datasets are to be matched. It also tracts any additions (if allowed) during the matching process.
#### **`[data]`**
dataframe containing the dataset to be added
#### **`"sub"`**
character string, either `"spp.list"` or `"data"`. Specifies which `[[m]]` element contains the smaller set of species. Unmatched species in the subset are attempted to be matched through synonyms to `[[m]]` element datapoints in the larger species set.
#### `"set"`
character string, either `"spp.list"` or `"data"`. Specifies which `[[m]]` element contains the larger set of species. Automatically determined from `m$sub`.
#### `"status"`
character string. Records status of `[[m]]` within the match process. varies between `{"unmatched", "full_match", "incomplete_match: (n unmatched)"}`
#### **`[[meta]]`**
list, length of `{meta.vars}` containing observation metadata for each `"meta.var"`. `meta$meta.var` can be a `[dataframe]` or `"character_string"`. The default `meta.vars` represent observation metadata commonly associated with macroecological datasets. `NULL` elements are added as NA
- `"ref"`: the reference from which observation has been sourced. This is the only `meta.var` that *MUST* be correctly supplied for matching to proceed.
- `"qc"`: any quality control information regarding individual datapoints. Ideally, a consistent scoring system used across compiled datasets.
- `"observer"`: The name of the observer of data points. Used to allow assessment of observer bias, particularly in the case of data sourced manually form literature.
- `"n"`: if value is based on a summary of multiple observations, the number of original observations value is based on.
- `"notes"`: any notes associated with individual observations.
#### **`"filename"`**
character string, name of the dataset filename. Using the filename consistently throughout the file system enables automating sourcing of data.
#### **`"unmatched"`**
stored details of unmatched species if species matching incomplete.
***
<br>
## additional `[metadata]`
### - variable `[metadata]`
Metadata on `{vars}` are stored on a separate sheet. Completeness of the metadata sheet is not only checked for but it also required for many of the functions. It's also extremely useful downstream, at data analysis and presentation stages.
**`[metadata]`** contains information on coded variables. Typical information includes:
- **`desc`**: a longer description of vars,
- **`cat`**: var category
- **`units`**
- **`type`**:
- `"bin"`: binary
- `"cat"`: categorical
- `"con"`: continuous
- `"int"`: integer
- **`scores`** & **`levels`** if the variable is categorical `"cat"` or binary `"bin"`.
- **`notes`** any textual information supplied with the data.
- **`log`** `T` or `F`. Often useful to be able to assign whether a variable should be logged for exploration and analysis.
```{r, warning=F, message=FALSE, echo=FALSE}
md <- metadata[c(2, 4:10),-c(2, 5, 4, 11, 13)]
kable(md,caption = "T2: example variable metadata sheet")
```
### - [syn.links]
Another important aspect of matching macroecological variables is using know synonym links across taxonomies to match species names across datasets. Synonyms are su two data column containing unique pairs of synonyms.
In my example, I provide `[syn.links]` which **contains unique synonym links I have compiled** throughout the projects I've worked on and **only pertains to birds**.
This is by no means complete and often, some manual matching (eg through [avibase]() is required. The hope is to integrate rmacroRDM with a package like [`taxise`](), linking the process to official repositories, automating as much as possible and enabling better tracking of the network of synonyms through taxonomies used to match species across datasets. [**ISSUE**]()
```{r, warning=F, message=FALSE, echo=FALSE}
print(head(syn.links, 8))
```
***
<br>
# {file.system} management
Many of the functions in the rmacroRDM package are set up to allow for automatic loading and processing of data from appropriately named folders. This allows quick and consistent processing of data. However it does depend on data and metadata being correctly labelled and saved in the appropriate folders. This tutorial will guide you through correct setup and walk through an example of adding a dataset to a master macro dataset.
<br>
## **setup**
The first thing to do is to specify the input, output and script folders, set up the data input folder and populate it with the appropriately named data in the appropriate folders.
### settings
```{r, warning=F, message=FALSE}
### SETTINGS ##############################################################
options(stringsAsFactors = F)
output.folder <- "~/Documents/workflows/rmacroRDM/data/output/"
input.folder <- "~/Documents/workflows/rmacroRDM/data/input/"
script.folder <- "~/Documents/workflows/rmacroRDM/R/"
```
### **setup `input.folder`**
Once initial settings have been made, you will need to setup the input folder. The easiest way to do this is to use the **`setupInputFolder()`**. This defaults to meta.vars: `"qc"`, `"observer"`, `"ref"`, `"n"`, `"notes"`. The function is however flexible so the meta.vars can be customised to meet users observation metadata needs. The basic folders created by the function are:
```{r, warning=F, message=FALSE, eval=FALSE}
# custom meta.variables can be assigned by supplying a vector of character strings to the meta.vars argument in function `setupInputFolder()`
meta.vars = c("qc", "observer", "ref", "n", "notes")
setupInputFolder(input.folder, meta.vars)
```
## **Populate input folders**
<br>
### **data**
- **`raw/`** : a folder to collect all raw data. These files are to be treated as *read only*.
- **`csv/`** : raw data files should be saved as .csv files in this folder. This is the folder from which most functions will source data to be compiled.
#### **`csv/`**
Because it is the most common form encountered, data in the `csv` folder is usually given in a wide format (ie species rows and variable columns). For traceability, is is good practice to name the `.csv` files as the **original raw data file name** from which they were created.
- eg. in our example the datasheet being added and saved in folder `csv/` as **`D1.csv`**. Any meta.var data associated with this dataset should also be saved in the appropriate meta.var folder **`D1.csv`**.
##### Pre-processing
Some pre-processing might be required. In particular, the column containing species data should be labelled **`species`**. In a pre-processing stage, you might need to match **master `code`** and **data** variable names representing the same variable. I recommend this be done in a scripted pre-processing step using a variable lookup table to keep track of raw variable names across data sets. eg
```{r, warning=F, message=FALSE, echo=FALSE}
kable(read.csv(paste(input.folder, "metadata/vnames.csv", sep = ""))[22:30, -2],
caption = "Correspondence of D0 & D1 dataset variable names to master variable codes")
```
If there are meta.var data included in the dataset, these should be labelled or appended with the appropriate meta.var suffix, (*eg. `ref` if meta.var* **ref** *relate to all variables or `body.mass_ref` if meta.var* **ref** *variable refers to a particular variable, in this case body mass.*) Correct naming of meta.var columns will allow **`separateMeta()`** to identify and extract meta.var data from the dataset. Also ensure taxonomic variables (*eg. class, order etc*) are removed from data to be added. So datasets should contain a `species` column, any *`variable`* data columns to be added and can additionally contain appropriately appended *`meta.var`* columns.
<br>
## **{meta.vars}**
### Supplying meta.vars.
Meta.vars can either be supplied directly to the appropriate functions by attaching to the appropriate element of the **`meta` list object** or, data can be saved in appropriately named folders. Files should be named the same as the data sheet being compiled.
### meta.var data formats
There are a number of formats meta variable data can be supplied in.
<br>
##### **single value across all species and variables**
If a **single value relates to all data** in the data file (eg all data sourced from a single reference), then meta.var can be supplied as a **single value or character string** (eg a character string of the reference from which the data has been sourced).
<br>
##### **single value across all variables, but not species**
If a **single value** relates to **all variables** in the data but **varies across species**, metavariable data should be supplied as a two column dataframe with columns named `species` and `all`, eg:
```{r, warning=F, message=FALSE, echo=FALSE}
kable(read.csv(paste(input.folder, "ref/all demo.csv", sep = ""))[1:5,],
caption = "Example ref meta.var data where reference is same across variables but varies across species")
```
<br>
##### **Value varies across variables and species**
There are two ways meta.var data that vary across species can be formatted. The simplest is a **species** x **var** dataframe where meta.var columns relating to specific variables are named according to the variables in the data they relate to. If meta.var columns correspond to groups of variable (eg different sources for groups of variables), two dataframes need to be supplied:
- One containing the group meta.var data with column names indicating variable group names eg:
```{r, warning=F, message=FALSE, echo=FALSE}
kable(read.csv(paste(input.folder, "ref/BirdFuncDat.csv", sep = ""))[1:5,],
caption = "Example ref meta.var data where reference is same across groups of variables but varies across species")
```
- A separate two column dataframe, with columns named **`var`** and **`grp`** linking individual variables to group meta.var names in the first dataframe, eg:
```{r, warning=F, message=FALSE, echo=FALSE}
dd <- read.csv(paste(input.folder, "ref/BirdFuncDat_ref_group.csv", sep = ""))
kable(dd[c(2:3, 12:13, 22),], caption = "Example ref meta.var data group to variable cross-reference table")
```
Note that variable names for most meta.vars can be assigned `NA` under `grp` in the `_group` data.frame in which case NA will be assigned for that meta.var for each variable observation. However, references MUST be provided for all variables and matching will not proceed until this condition has been met.
## **`syn.links`**
syn.links needs to be a two column data.frame of unique synonym links
<br>
##### **metadata/**
- `[metadata]`: should contain a **"metadata.csv"** file with information on all variables in the master datasheet.
- `[vnames]`: table of variable name correspondence across datasets.
<br>
##### **taxo/**
- `[taxo.table]`: table containing taxonomic information
<br>
## **example workflow**
In this example we will demonstrate the use of ***rmacroRDM functions*** to merge datasets **`D0`** and **`D1`** into a **`master`** datasheet.
We will use `D0` to set up the `master` and then merge dataset `D1` to it.
First set the location of the **input/**, **output/** and **script/** folders. Make sure the [**`functions.R`**](https://github.com/annakrystalli/rmacroRDM/blob/master/R/functions.R) and [**`wideData_function.R`**](https://github.com/annakrystalli/rmacroRDM/blob/master/R/wideData_function.R) scripts are saved in the **scripts/** folder and `source`.
```{r, warning=F, message=FALSE}
### SETTINGS ##############################################################
options(stringsAsFactors = F)
output.folder <- "~/Documents/workflows/rmacroRDM/data/output/"
input.folder <- "~/Documents/workflows/rmacroRDM/data/input/"
script.folder <- "~/Documents/workflows/rmacroRDM/R/"
# Functions & Packages
require(dplyr)
# source rmacroRDM functions
source(paste(script.folder, "functions.R", sep = ""))
source(paste(script.folder, "wideData_function.R", sep = ""))
```
<br>
Also we set a number of parameters which will configure the master and spp.list setup.
```{r, warning=F, message=FALSE}
# master settings
var.vars <- c("var", "value", "data.ID")
match.vars <- c("synonyms", "data.status")
meta.vars = c("qc", "observer", "ref", "n", "notes")
master.vars <- c("species", match.vars, var.vars, meta.vars)
# spp.list settings
taxo.vars <- c("genus", "family", "order")
```
```{r, warning=F, message=FALSE, eval=FALSE}
# custom meta.variables can be assigned by supplying a vector of character strings to the metadata
# argument in function setupInputFolder()
setupInputFolder(input.folder, meta.vars)
```
Once folders are correctly populated, load D0.
```{r, warning=F, message=FALSE}
D0 <- read.csv(file = paste(input.folder, "csv/D0.csv", sep = "") ,fileEncoding = "mac")
```
```{r, warning=F, message=FALSE, echo=FALSE}
D0dat <- data.frame(matrix(NA, ncol = length(c(master.vars, taxo.vars)),
nrow = dim(D0)))
names(D0dat) <- c("species", taxo.vars, master.vars[-1])
keep <- names(D0)[names(D0) %in% names(D0dat)]
D0dat[match(keep, names(D0dat))] <- D0[,keep]
D0dat$synonyms <- D0dat$species
D0dat$data.status = "original"
D0dat$ref <- paste(substr(D0dat$ref, 1, 10), "...")
D0 <- D0dat
```
```{r, warning=F, message=FALSE, echo=FALSE}
kable(head(D0, 8), caption = "D0")
```
#### create **`[spp.list]`** object
The next step in setting up our master datasheet is to assign the species list to which all other data are to be matched. In our example, we are using the **species list** in **dataset `D0`** to which we will then add dataset **`D1.csv`**. We can also store taxonomic information on the spp.list, by supplying a `[taxo.dat]` containing taxonomic information on all species in `species` and `{taxo.vars}`
Columns `master.spp` and `rel.spp` keep track of any species added during the matching process in order to retain data points, rather than discard duplicate datapoints in the dataset to be merged that might be matching to the same individual species on the master species list. In the case of an added species, the value in `master.spp` will be **FALSE** and `rel.spp` will contain the name of the single species in master species list that the matching function identified duplicate matches with. This allows all possible data to be retained but the information in the spp.list allows such datapoints to be removed from analyses if required.
If there is taxonomic data, this can be included in the **`spp.list`** data.frame. For example, `D0` contains further taxonomic data on **genus**, **family**, **order**. We add this to the **`spp.list`** dataframe:
```{r, warning=F, message=FALSE}
# Create taxo.table
taxo.dat <- unique(D0[,c("species", taxo.vars)])
spp.list <- createSpp.list(species = taxo.dat$species,
taxo.dat = taxo.dat,
taxo.vars)
head(spp.list)
```
#### load **`[metadata]`**
```{r, warning=F, message=FALSE, echo=FALSE}
kable(md,caption = "T2: example variable metadata sheet")
```
#### load **`syn.links`**
```{r, warning=F, message=FALSE, echo=FALSE}
# Load match data.....................................................................
head(syn.links)
```
<br>
#### create master object
Finally create the **`[[master]]`** object.
```{r, warning=F, message=FALSE}
# create master shell
master <- list(data = newMasterData(master.vars), spp.list = spp.list, metadata = metadata)
```
`D0` is almost in the master data format, we just need to remove all taxonomic information. In this case, I subset `D0` to variables not in `{taxo.vars}`. Now that `D0` is in the master data format, I can update the empty `[[master]]` object with the data. Also, because the species list was generated from D0, we do not need to update the `spp.list`, although the function checks for species matching anyways.
```{r, warning=F, message=FALSE}
D0 <- D0[,!names(D0) %in% taxo.vars]
longto
master <- updateMaster(master, data = D0, spp.list = NULL)
str(master)
```
#### create `[[m]]` object
Next assign the dataset filename, to be used to automate data loading
```{r, warning=F, message=FALSE, echo=F}
filename <- "D1"
m <- matchObj(data.ID = "D1", spp.list = spp.list, status = "unmatched",
data = read.csv(paste(input.folder, "csv/", filename, ".csv", sep = ""),
stringsAsFactors=FALSE, fileEncoding = "mac"),
sub = "spp.list", filename = filename,
meta = createMeta(meta.vars)) # use addMeta function to manually add metadata.
```
```{r, eval=FALSE}
filename <- "D1"
m <- matchObj(data.ID = "D1", spp.list = spp.list, status = "unmatched",
data = read.csv(paste(input.folder, "csv/", filename, ".csv", sep = ""),
stringsAsFactors=FALSE),
sub = "spp.list", filename = filename,
meta = createMeta(meta.vars)) # use addMeta function to manually add metadata.
```
Here, we use the filename to load the data into `matchObj()`. We define `"spp.list"` as the sub dataset. The match functions will therefore identify and attempt to match unmatched species names in the `{spp.list$species}`. We also create a `[[meta]]` object using function `createMeta(meta.vars)` and supplying `{meta.vars}`.
The resulting **`[[m]]`** object has the following structure:
```{r}
str(m)
```
<br>
#### process `[[m]]` object
Once the `[[m]]` object is created, I pipe it a number of through the **`rmacroRDM`** processing functions:
```{r, warning=F, message=FALSE, eval=FALSE}
m <- processDat(m, input.folder, var.omit) %>%
separateDatMeta() %>%
compileMeta(input.folder = input.folder) %>%
checkVarMeta(master$metadata) %>%
dataMatchPrep()
```
<br>
#### let's take a closer look
**`processDat()`** cleans the data and removes unwanted variables.
```{r}
m <- processDat(m, input.folder, var.omit = NULL)
str(m$data)
```
<br>
**`separateDatMeta()`** separates data columns from data, correctly appended as `{meta.vars}`. In this case, the data column `qc` is separated and processed into a valid `meta.var` element and appended to `[[meta]]$qc`.
```{r}
m <- separateDatMeta(m)
str(m$data)
str(m$meta)
```
<br>
**`compileMeta()`** automates the process of sourcing, checking and setting up metadata to be compiled into the long data format. It will check through the input.folder `{file.sytem}` for correctly labelled and filed `{meta.vars}` data and compile it into missing `m$[[meta]]` elements.
In this case it appends data automatically loaded from the `meta.var` folders. Only data in **ref/** and **n/** have been supplied. Data in **ref/** contains reference data for all species and variables in a single `.csv` file: **`D1.csv`**.
```{r, warning=F, message=FALSE, echo=F}
ref.meta <- read.csv(paste(input.folder, "ref/", filename, ".csv", sep = ""))
for(i in 2:length(ref.meta)){
ref.meta[!is.na(ref.meta[,i]),i] <- paste(substr(ref.meta[!is.na(ref.meta[,i]),i], 1, 15), "...")}
str(ref.meta)
```
Data in **n/** are given again in **`D1.csv`**:
```{r, warning=F, message=FALSE, echo=F}
n.meta <- read.csv(paste(input.folder, "n/", filename, ".csv", sep = ""))
str(n.meta)
```
However `n` data are missing for some `vars` so a group cross-reference table **`D1_n_group.csv`** is also supplied. This table was used to check variable matches and confirm missing meta.var data as NA. Note. NAs not allowed for `meta.var == "ref"`.
```{r, warning=F, message=FALSE, echo=F}
n.group <- read.csv(paste(input.folder, "n/", "D1_n_group", ".csv", sep = ""))
str(n.group)
```
```{r}
m <- compileMeta(m, input.folder = input.folder)
```
```{r, warning=F, message=FALSE, echo=F}
for(i in 2:length(m$meta$ref)){
m$meta$ref[!is.na(m$meta$ref[,i]),i] <- paste(substr(m$meta$ref[!is.na(m$meta$ref[,i]),i], 1, 15), "...")}
```
```{r}
str(m$meta)
```
As can be seen, data for `{meta.vars}` `"ref"` and `"n"` have been processed and appended to the appropriate `m$[[meta]]` element.
<br>
**`checkVarMeta()`** checks that all `vars` in `m$[data]` have valid metadata information in `[metadata]`
```{r}
m <- checkVarMeta(m, master$metadata)
```
All good.
<br>
**`dataMatchPrep()`** prepares `m$[data]` to track synonym matching.
```{r}
m <- dataMatchPrep(m)
str(m$data)
```
<br>
***
#### match `[[m]]` object
```{r, warning=F, message=FALSE}
m <- dataSppMatch(m, ignore.unmatched = T,
syn.links = syn.links, addSpp = T)
str(m)
```
When `ignore.unmatched = T`,**sub** species that have not automatically been matched to **set** species are ignored and omitted from the dataset. When `ignore.unmatched = F`, the function halts and appends `{unmatched}` species list to `[[m]]`.
#### compile data to master format
```{r, warning=F, message=FALSE}
output <- masterDataFormat(m, meta.vars, match.vars, var.vars)
kable(head(output$data))
```
```{r, warning=F, message=FALSE}
master <- updateMaster(master, output)
str(master)
```