forked from shubhi124081/neon4cast-beetles
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathworkflow.Rmd
287 lines (186 loc) · 10.1 KB
/
workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: "Null Forecasts for the EFI NEON Community Ecology Challenge"
output:
github_document:
df_print: tibble
---
This document illustrates the sequential steps for posing, producing and scoring a forecast for the community ecology challenge:
1. Download NEON beetle data
2. Clean / process data into (a) observed richness and (b) a proxy for relative abundance, (counts/trapnight), data products which teams will seek to predict future values for
3. Generate a dummy (null) probablistic forecast at each site, using historical mean and standard deviations,
4. Score the dummy forecast
This document also shows one approach to capturing, sharing, and 'registering' the products associated with each step (raw data, processed data, forecast, scores),
with content-based identifiers. These identifiers can act like DOIs but can be computed directly from the data, and thus can be generated locally at no cost, no
authentication, and no lock-in to a specific storage provider.
```{r setup, message=FALSE, warning=FALSE}
library(tidyverse)
# Helper libraries for one way of managing downloads and registering products; not essential.
library(neonstore) # remotes::install_github("cboettig/neonstore")
library(contentid) # remotes::install_github("cboettig/contentid")
## Two helper functions are provided in external scripts
source("R/resolve_taxonomy.R")
source("R/publish.R")
## neonstore can cache raw data files for future reference
Sys.setenv("NEONSTORE_HOME" = "cache/")
## Set the year of the prediction. This will be excluded from the training data and used in the forecast
forecast_year <- 2019
```
## Download data
This example uses `neonstore` to download and manage raw files locally.
```{r}
## full API queries take ~ 2m.
## full download takes ~ 5m (on Jetstream)
start_date <- NA # Update to most recent download, or NA to check all.
neonstore::neon_download(product="DP1.10022.001", start_date = start_date)
```
## Load data
```{r message=FALSE}
library(neonstore)
sorting <- neon_read("bet_sorting", altrep = FALSE) %>% distinct()
para <- neon_read("bet_parataxonomistID", altrep = FALSE) %>% distinct()
expert <- neon_read("bet_expertTaxonomistIDProcessed", altrep = FALSE) %>% distinct()
field <- neon_read("bet_fielddata", altrep = FALSE) %>% distinct()
# vroom altrep is faster but we have too many files here!
# NEON sometimes provides duplicate files with different filename metadata (timestamps), so I am currently using `distinct()` to deal with that...
```
Publish the index of all raw data files we have used, including their `md5sum`, for future reference.
While these files are available from NEON and the local cache, this should help us detect any changes
in the raw data, if need be.
```{r}
raw_file_index <- neon_index(product="DP1.10022.001", hash = "md5")
readr::write_csv(raw_file_index, "products/raw_file_index.csv")
publish("products/raw_file_index.csv")
```
## Process data
First, we resolve the taxonomy using the expert and parataxonomist classification, where available. Because these
products lag behind initial identification of the `sorting` table, and because the sorting technicians do not pin
all samples (either because they are confident in the classification already, or sample is damaged, etc), not all
beetles will have expert identification. Those with taxonomist identification will have an `individualID` assigned.
Those with expert identification will also name the expert. Then, samples identified as non-carabids (by either
the technicians or the taxonomists) are excluded from the dataset.
For convenience, we also add month and year as separate columns from the `collectDate`, allowing for easy grouping.
```{r message = FALSE}
beetles <- resolve_taxonomy(sorting, para, expert) %>%
mutate(month = lubridate::month(collectDate, label=TRUE),
year = lubridate::year(collectDate))
```
## Generate derived richness product
This example focuses on `taxonID` as the unit of taxonomy, which corresponds to best resolved scientific name. Use `morphospecies` to focus on species-level (binomal names only),
using morphospecies where available and where official classification was only resolved to a higher rank. The latter results in higher observed richness.
```{r message = FALSE}
richness <- beetles %>%
select(taxonID, siteID, collectDate, month, year) %>%
distinct() %>%
count(siteID, collectDate, month, year)
richness
```
## Generate derived abundance product
We target a catch-per-unit-effort (CPUE) metric for abundance, e.g. to avoid the problem of having contestants have to predict the number of trap nights there will be. (Quite a warranted concern for 2020! Overall variability is over 30%, while 2018 & 2019 it is 22%.) This does suggest the assumption that trapnights are exchangeable, but teams accounting for things like weather on each night could still forecast raw counts and then convert their forecast to this simpler metric.
```{r message = FALSE}
effort <- field %>%
group_by(siteID, collectDate) %>%
summarize(trapnights = as.integer(sum(collectDate - setDate)))
#summarize(trapnights = sum(trappingDays)) ## Has bunch of NAs this way
counts <- sorting %>%
mutate(month = lubridate::month(collectDate, label=TRUE),
year = lubridate::year(collectDate)) %>%
group_by(collectDate, siteID, year, month) %>%
summarize(count = sum(individualCount, na.rm = TRUE))
abund <- counts %>%
left_join(effort) %>%
mutate(abund = count / trapnights) %>% ungroup()
abund
```
## Publish the derived data products
Our first product is the derived data for `richness` and `abund`. We write the files to disk and publish them under
content-based identifier. Using <https://hash-archive.org> or `contentid::resolve()`, we could then later resolve these IDs.
```{r}
readr::write_csv(richness, "products/richness.csv")
readr::write_csv(abund, "products/abund.csv")
publish(c("products/richness.csv", "products/abund.csv"))
```
## Compute (null model) Forecasts
### Baseline forecast
For the groups with only 1 data point we cannot compute `sd`, let's use the average `sd` of all the other data instead as our guess.
Note that some months may wind up having caught beetles in the future, even though we have no catch in the data to date. These will
end up as `NA` scores unless we include a mechanism to convert them to estimates (e.g. we should probably estimate a value of 0 for all months
for which we have no catch.)
To mimic scoring our forecast, we will remove data from `r forecast_year` or later. The actual null forecast should of course omit that filter.
```{r}
null_richness <- richness %>%
filter(year < forecast_year) %>%
group_by(month, siteID) %>%
summarize(mean = mean(n, na.rm = TRUE),
sd = sd(n, na.rm = TRUE)) %>%
mutate(sd = replace_na(sd, mean(sd, na.rm=TRUE))) %>%
mutate(year = forecast_year)
null_richness
```
```{r}
null_abund <- abund %>%
filter(year < forecast_year) %>%
group_by(month, siteID) %>%
summarize(mean = mean(abund, na.rm=TRUE),
sd = sd(abund, na.rm=TRUE)) %>%
mutate(sd = replace_na(sd, mean(sd, na.rm=TRUE))) %>%
mutate(year = forecast_year)
```
## Publish the forecast products
```{r}
readr::write_csv(null_richness, "products/richness_forecast.csv")
readr::write_csv(null_abund, "products/abund_forecast.csv")
publish(c("products/richness_forecast.csv", "products/abund_forecast.csv"))
```
## Score the forecast
```{r}
## predicted_df must have columns: mean, sd, and any grouping variables (siteID, month)
## true_df must have column: 'true' and the same grouping variables with same colname and datatype
score <- function(predicted_df,
true_df,
scoring_fn = function(x, mu, sigma){ -(mu - x )^2 / sigma^2 - log(sigma)}
){
true_df %>%
left_join(predicted_df) %>%
mutate(score = scoring_fn(true, mean, sd))
}
```
Extract the true richnesses for `r forecast_year` and compute score:
```{r message=FALSE}
true_richness <- richness %>%
filter(year >= forecast_year) %>%
select(month, siteID, true = n)
richness_score <- score(null_richness, true_richness)
```
Extract the observed abundance measure (counts/trapnight) for `r forecast_year` and compute score
```{r message=FALSE}
true_abund <- abund %>%
filter(year >= forecast_year) %>%
select(month, siteID, true = abund)
abund_score <- score(null_abund, true_abund)
```
Note that removing `NA`s in a sum of scores is unfair, as "0" reflects a perfect score.
To avoid this, one option is to compute the mean score across sites:
```{r}
richness_score %>% summarize(mean_score = mean(score, na.rm= TRUE))
abund_score %>% summarize(mean_score = mean(score, na.rm= TRUE))
```
## Publish the scores
```{r}
readr::write_csv(richness_score, "products/richness_score.csv")
readr::write_csv(abund_score, "products/abund_score.csv")
publish(c("products/richness_score.csv", "products/abund_score.csv"))
```
---
## Retreiving products by identifier
Note that publishing a product generates a content-based identifier (simply using the sha256 hash of the file). If the file ever changes, so will this identifier.
We can access any of the files published above using the identifier. Notably, this approach is angostic to the _location_ where the
file is stored, and works well with the same file stored in many locations. In particular, we may want to copy these files over
to a permanent archive later. By registering that as just another location for the same content, we can always use the same identifier.
In this way, this approach is compatible with DOI-based archival repositories, but lets us separate the step of generating and
registering the files from permanently archiving them. That way, we can generate millions of copies locally (i.e. in debugging our forecast)
but not worry about writing junk to permanent storage before we are ready to do so.
```{r message=FALSE}
## Now anyone can later resolve these identifiers to download the content, e.g.
richness_forecast_csv <- contentid::resolve("hash://sha256/92af71bd4837a6720794582b1e7b8970d0f57bf491be4a51e67c835802005960")
richness_forecast <- read_csv(richness_forecast_csv)
```