-
Notifications
You must be signed in to change notification settings - Fork 4
/
ebird.qmd
786 lines (627 loc) · 50.5 KB
/
ebird.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
---
output: html_document
editor_options:
chunk_output_type: console
---
# eBird Data {#sec-ebird}
## Introduction {#sec-ebird-intro}
eBird data are collected and organized around the concept of a checklist, representing observations from a single birding event, such as a 1 km walk through a park or 15 minutes observing bird feeders in your backyard. Each checklist contains a list of species observed, counts of the number of individuals seen of each species, the location and time of the observations, information on the type of survey performed, and measures of the effort expended while collecting these data. The following image depicts a typical eBird checklist as viewed [on the eBird website](https://ebird.org/checklist/S46908897):
![](images/ebird-data_checklist.png)
There are three key characteristics that distinguish eBird from many other citizen science projects and facilitate robust ecological analyses. First, observers specify the survey protocol they used, whether it's traveling, stationary, incidental (i.e., if the observations were collected when birding was not the primary activity), or one of the other protocols. These protocols are designed to be flexible, allowing observers to collect data during their typical birding outings. Second, in addition to typical information on when and where the observations were made, observers record effort information specifying how long they searched, how far they traveled, and the total number of observers in their party. Collecting this data facilitates robust analyses by allowing researchers to account for variation in the observation process [@lasorteOpportunitiesChallengesBig2018; @kellingFindingSignalNoise2018]. Finally, observers are asked to indicate whether they are reporting all the birds they were able to detect and identify. Checklists with all species reported, known as **complete checklists**, enable researchers to infer counts of zero individuals for the species that were not reported. If checklists are not complete, it's not possible to ascertain whether the absence of a species on a list was a non-detection or the result of a participant not recording the species.
Citizen science projects occur on a spectrum from those with predefined sampling structures that resemble more traditional survey designs (such as the [Breeding Bird Survey](https://www.pwrc.usgs.gov/bbs/) in the United States), to those that are unstructured and collect observations opportunistically (such as [iNaturalist](https://www.inaturalist.org/)). We refer to eBird as a **semi-structured** project [@kellingFindingSignalNoise2018], having flexible, easy to follow protocols that attract many participants, but also collecting data on the observation process and allowing non-detections to be inferred on complete checklists.
In this chapter, we'll highlight some of the challenges associated with using eBird data. Then we'll demonstrate how to download eBird data for a given region and species. Next, we'll show how to import the data into R, apply filters to it, and use complete checklists to produce detection/non-detection data suitable for modeling species distribution and abundance. Finally, we'll perform some pre-processing steps required to ensure proper analysis of the data.
::: callout-tip
## Tip
We use the terms **detection** and **non-detection** rather than the more common terms **presence** and **absence** throughout this guide to reflect the fact that an inferred count of zero does not necessarily mean that a species is absent, only that it was not detected on the checklist in question.
:::
## Challenges associated with eBird data {#sec-ebird-challenges}
Despite the strengths of eBird data, species observations collected through citizen science projects present a number of challenges that are not found in conventional scientific data. The following are some of the primary challenges associated these data; challenges that will be addressed throughout this guide:
- **Taxonomic bias:** participants often have preferences for certain species, which may lead to preferential recording of some species over others [@greenwoodCitizensScienceBird2007; @tullochBehaviouralEcologyApproach2012]. Restricting analyses to complete checklists largely mitigates this issue.
- **Spatial bias:** most participants in citizen science surveys sample near their homes [@luckAlleviatingSpatialConflict2004], in easily accessible areas such as roadsides [@kadmonEffectRoadsideBias2004], or in areas and habitats of known high biodiversity [@prendergastCorrectingVariationRecording1993]. A simple method to reduce the spatial bias that we describe is to create an equal area grid over the region of interest, and sample a given number of checklists from within each grid cell.
- **Temporal bias:** participants preferentially sample when they are available, such as weekends [@courterWeekendBiasCitizen2013], and at times of year when they expect to observe more birds, notably in the United States there is a large increase in eBird submissions during spring migration [@sullivanEBirdEnterpriseIntegrated2014]. Furthermore, eBird has steadily increased in popularity over time, leading a strong bias towards more data in recent years. To address the weekend bias, we recommend using a temporal scale of a week or multiple weeks for most analyses. Temporal biases at longer time scales can be addressed by subsampling the data to produce a more even temporal distribution.
- **Class imbalance:** bird species that are rare or hard to detect may have data with high class imbalance, with many more checklists with non-detections than detections. For these species, a distribution model predicting that the species is absent everywhere will have high accuracy, but no ecological value. We'll follow the methods for addressing class imbalance proposed by Robinson et al. [-@robinsonUsingCitizenScience2018], sampling the data to artifically increase the prevalence of detections prior to modeling.
- **Spatial precision:** the spatial location of an eBird checklist is given as a single latitude-longitude point; however, this may not be precise for two main reasons. First, for traveling checklists, this location represents just one point on the journey. Second, eBird checklists are often assigned to a **hotspot** (a common location for all birders visiting a popular birding site) rather than their true location. For these reasons, it's not appropriate to align the eBird locations with very precise habitat variables, and we recommend summarizing variables within a neighborhood around the checklist location.
- **Variation in detectability/effort:** detectability describes the probability of a species that is present in an area being detected and identified. Detectability varies by season, habitat, and species [@johnstonSpeciesTraitsExplain2014; @johnstonEstimatesObserverExpertise2018]. Furthermore, eBird data are collected with high variation in effort, time of day, number of observers, and external conditions such as weather, all of which can affect the detectability of species [@ellisEffectsWeatherTime2018; @oliveiraObservationDiurnalSoaring2018]. Therefore, detectability is particularly important to consider when comparing between seasons, habitats or species. Since eBird uses a semi-structured protocol, that collects data on the observation process, we'll be able to account for a larger proportion of this variation in our analyses.
The remainder of this guide will demonstrate how to address these challenges using real data from eBird to produce reliable estimates of species distributions. In general, we'll take a two-pronged approach to dealing with unstructured data and maximizing the value of citizen science data: imposing more structure onto the data via data filtering and including predictor variables describing the obsservation process in our models to account for the remaining variation.
## Downloading data {#sec-ebird-download}
eBird data are typically distributed in two parts: observation data and checklist data. In the observation dataset, each row corresponds to the sighting of a single species on a checklist, including the count and any other species-level information (e.g. age, sex, species comments, etc.). In the checklist dataset, each row corresponds to a checklist, including the date, time, location, effort (e.g. distance traveled, time spent, etc.), and any additional checklist-level information (e.g. whether this is a complete checklist or not). The two datasets can be joined together using a unique checklist identifier (sometimes referred to as the sampling event identifier).
The observation and checklist data are released as tab-separated text files referred to as the *eBird Basic Dataset (EBD)* and the *Sampling Event Data (SED)*, respectively. These files are released monthly and contain all validated bird sightings in the eBird database at the time of release. Both of these datasets can be downloaded in their entirety or a subset for a given species, region, or time period can be requested via the *Custom Download* form. We strongly recommend against attempting to download the complete EBD since it's well over 100GB at the time of writing. Instead, we will demonstrate a workflow using the Custom Download approach. In what follows, we will assume you have followed the instructions for requesting access to eBird data outlined in [the previous chapter](intro.qmd#sec-intro-setup-ebird).
![Wood Thrush © Veronica Araya Garcia, Macaulay Library (ML60255811)](images/woothr_60255811.jpg)
In the interest of making examples concrete, throughout this guide, we'll focus on moedeling the distribution of [Wood Thrush](https://ebird.org/species/woothr) in Georgia (the US state, not the country) in June. Wood Thrush breed in deciduous forests of the eastern United States. We'll start by downloading the corresponding eBird observation (EBD) and checklist (SED) data by visiting the [eBird Basic Dataset](https://ebird.org/data/download/ebd) download page and filling out the Custom Download form to request Wood Thrush observations from Georgia. **Make sure you check the box "Include sampling event data"**, which will include the SED in the data download in addition to the EBD.
![](images/ebird-data_download.png)
Once the data are ready, you will receive an email with a download link. The downloaded data will be in a compressed .zip format, and should be unarchived. The resulting directory will contain a two text files: one for the EBD (e.g. `ebd_US-GA_woothr_smp_relOct-2023.txt`) containing all the Wood Thrush observations from Georgia and one for the SED (e.g. `ebd_US-GA_woothr_smp_relOct-2023_sampling.txt`) containing all checklists from Georgia. The `relOct-2023` component of the file name describes which version of the EBD this dataset came from; in this case it's the October 2023 release.
::: callout-tip
## Tip
Since the EBD is updated monthly, you will likely recieve a different version of the data than the October 2023 version used throughout the rest of this lesson. Provided you update the filenames of the downloaded files accordingly, the difference in versions will not be an issue. However, if you want to download and use exactly the same files used in this lesson, you can [download the corresponding EBD zip file](https://cornell.box.com/shared/static/lwd1rm163r2pfe389fi0n084hmyr0jbv.zip).
:::
## Importing eBird data into R {#sec-ebird-import}
The previous step left us with two tab separated text files, one for the EBD (i.e. observation data) and one for the SED (i.e. checklist data). For this example, we've placed the downloaded text files in the `data-raw/` sub-directory of our working directory. Feel free to put these files in a place that's convenient to you, but make sure to update the file paths in the following code blocks.
The `auk` functions [`read_ebd()`](https://cornelllabofornithology.github.io/auk/reference/read_ebd.html) or [`read_sampling()`](https://cornelllabofornithology.github.io/auk/reference/read_ebd.html) are designed to import the EBD and SED, respectively, into R. First let's import the checklist data (SED).
```{r}
#| label: ebird-import-sed
library(auk)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(lubridate)
library(readr)
library(sf)
f_sed <- "data-raw/ebd_US-GA_woothr_smp_relOct-2023_sampling.txt"
checklists <- read_sampling(f_sed)
glimpse(checklists)
```
::: callout-important
## Checkpoint
Take some time to explore the variables in the checklist dataset. If you're unsure about any of the variables, consult the metadata document that came with the data download (`eBird_Basic_Dataset_Metadata_v1.15.pdf`).
:::
For some applications, only the checklist data are required. For example, the checklist data can be used to investigate the spatial and temporal distribution of eBird data within a region. This dataset can also be used to explore how much variation there is in the effort variables and identify checklists that have low spatial or temporal precision.
::: {.callout-caution icon="false"}
## Exercise
Make a histogram of the distribution of distance traveling for traveling protocol checklists.
:::
::: {.callout-note icon="false" collapse="true"}
## Solution
More than 95% of checklists are less than 10 km in length; however, some checklists are as long as 80 km in length. Long traveling checklists have lower spatial precision so they should generally be removed prior to analysis.
```{r}
#| label: ebird-import-distance-sol
checklists_traveling <- filter(checklists, protocol_type == "Traveling")
ggplot(checklists_traveling) +
aes(x = effort_distance_km) +
geom_histogram(binwidth = 1,
aes(y = after_stat(count / sum(count)))) +
scale_y_continuous(limits = c(0, NA), labels = scales::label_percent()) +
labs(x = "Distance traveled [km]",
y = "% of eBird checklists",
title = "Distribution of distance traveled on eBird checklists")
```
:::
Now let's import the observation data.
```{r}
#| label: ebird-import-ebd
f_ebd <- "data-raw/ebd_US-GA_woothr_smp_relOct-2023.txt"
observations <- read_ebd(f_ebd)
glimpse(observations)
```
::: callout-important
## Checkpoint
Take some time to explore the variables in the observation dataset. Notice that the EBD duplicates many of the checklist-level variables from the SED.
:::
When any of the read functions from `auk` are used, three important processing steps occur by default behind the scenes.
1. **Variable name and type cleanup**: The read functions assign clean variable names (in `snake_case`) and correct data types to all variables in the eBird datasets.
2. **Collapsing shared checklist**: eBird allows [sharing of checklists](https://support.ebird.org/en/support/solutions/articles/48000625567-checklist-sharing-and-group-accounts) between observers part of the same birding event. These checklists lead to duplication or near duplication of records within the dataset and the function [`auk_unique()`](https://cornelllabofornithology.github.io/auk/reference/auk_unique.html), applied by default by the `auk` read functions, addresses this by only keeping one independent copy of each checklist.
3. **Taxonomic rollup**: eBird observations can be made at levels below species (e.g. subspecies) or above species (e.g. a bird that was identified as a duck, but the species could not be determined); however, for most uses we'll want observations at the species level. [`auk_rollup()`](https://cornelllabofornithology.github.io/auk/reference/auk_rollup.html) is applied by default when [`read_ebd()`](https://cornelllabofornithology.github.io/auk/reference/read_ebd.html) is used. It drops all observations not identifiable to a species and rolls up all observations reported below species to the species level.
Before proceeding, we'll briefly demonstrate how shared checklists are collapsed and taxonomic rollup is performed. In practice, the `auk` read functions apply these processing steps by default and most data users will not have to worry about them.
### Shared checklists {#sec-ebird-import-shared}
eBird allows users to [share checklists](https://support.ebird.org/en/support/solutions/articles/48000625567-checklist-sharing-and-group-accounts#anchorShareChecklists) with other eBirders in their group, for example [this checklist](https://ebird.org/checklist/S133864820) is shared by 8 observers. These checklists can be identified by looking at the `group_identifier` variable, which assigns an ID connecting all checklists in the group. To demonstrate this, we'll read the checklist data in again, but with the argument `unique = FALSE` used to prevent `read_sampling()` from collapsing the shared checklists.
```{r}
#| label: ebird-import-shared-example
checklists_shared <- read_sampling(f_sed, unique = FALSE)
# identify shared checklists
checklists_shared |>
filter(!is.na(group_identifier)) |>
arrange(group_identifier) |>
select(sampling_event_identifier, group_identifier)
```
::: callout-tip
## Tip
Sometimes it's useful to inspect an eBird checklist online. You can view a checklist on the eBird website by appending the `sampling_event_identifier` to the URL `https://ebird.org/checklist/`. For example, to look at the checklist with ID `S133864820`, visit [https://ebird.org/checklist/S19814680](https://ebird.org/checklist/S133864820).
:::
Checklists with the same `group_identifier` provide duplicate information on the same birding event in the eBird database. For most analyses, it's important to collapse these shared checklists down into a single checklist to avoid pseudoreplication. This can be accomplished with the function [`auk_unique()`](https://cornelllabofornithology.github.io/auk/reference/auk_unique.html), which retains only one independent copy of each checklist.
```{r}
#| label: ebird-import-shared-unique
checklists_unique <- auk_unique(checklists_shared, checklists_only = TRUE)
nrow(checklists_shared)
nrow(checklists_unique)
```
Notice that a new variable, `checklist_id`, was created that is set to `group_identifier` for shared checklists and `sampling_event_identifier` for non-shared checklists.
```{r}
#| label: ebird-import-shared-group
head(checklists_unique$checklist_id)
tail(checklists_unique$checklist_id)
```
::: callout-tip
## Tip
Curious what checklists and observers contributed to a shared checklist after it has been collapsed? The `sampling_event_identifier` and `observer_id` contain comma-separated lists of all checklists and observers that went into the shared checklists.
```{r}
#| label: ebird-import-shared-tip
checklists_unique |>
filter(checklist_id == "G7637089") |>
select(checklist_id, group_identifier, sampling_event_identifier, observer_id)
```
:::
### Taxonomic rollup {#sec-ebird-import-rollup}
eBird observations can be made at levels below species (e.g. subspecies) or above species (e.g. a bird that was identified as a duck, but the species could not be determined); however, for most uses we'll want observations at the species level. This is especially true if we want to produce detection/non-detection data from complete checklists because "complete" only applies at the species level.
In the example dataset used for this workshop, these taxonomic issues don't apply. We have specifically requested Wood Thrush observations, so we haven't received any observations for taxa above species, and Wood Thrush only no subspecies reportable in eBird. However, in many other situations, these taxonomic issues can be important. For example, [this checklist](https://ebird.org/checklist/S100099262) has 10 Yellow-rumped Warblers, 5 each of two Yellow-rumped Warbler subspecies, and one hybrid between the two subspecies.
The function [`auk_rollup()`](https://cornelllabofornithology.github.io/auk/reference/auk_rollup.html) drops all observations not identifiable to a species and "rolls up" all observations reported below species to the species level.
::: callout-tip
## Tip
If multiple taxa on a single checklist roll up to the same species, `auk_rollup()` attempts to combine them intelligently. If each observation has a count, those counts are added together, but if any of the observations is missing a count (i.e. the count is "X") the combined observation is also assigned an "X". In the [example checklist](https://ebird.org/checklist/S100099262) from the previous tip, with four taxa all rolling up to Yellow-rumped Warbler, `auk_rollup()` will add the four counts together to get 21 Yellow-rumped Warblers (10 + 5 + 5 + 1).
:::
To demonstrate how taxonomic rollup works, we'll use a small example dataset provided with the `auk` package.
```{r}
#| label: ebird-import-shared-auk
# import one of the auk example datasets without rolling up taxonomy
obs_ex <- system.file("extdata/ebd-rollup-ex.txt", package = "auk") |>
read_ebd(rollup = FALSE)
# rollup taxonomy
obs_ex_rollup <- auk_rollup(obs_ex)
# identify the taxonomic categories present in each dataset
unique(obs_ex$category)
unique(obs_ex_rollup$category)
# without rollup, there are four observations
obs_ex |>
filter(common_name == "Yellow-rumped Warbler") |>
select(checklist_id, category, common_name, subspecies_common_name,
observation_count)
# with rollup, they have been combined
obs_ex_rollup |>
filter(common_name == "Yellow-rumped Warbler") |>
select(checklist_id, category, common_name, observation_count)
```
## Filtering to study region and season {#sec-ebird-filter}
The Custom Download form allowed us to apply some basic filters to the eBird data we downloaded: we requested only Wood Thrush observations and only those on checklists from Georgia. However, in most cases you'll want to apply additional spatial and/or temporal filters to the data that are specific to your study. For the examples used throughout this guide we'll only want observations from June for the last 10 years (2014-2023). In addition, we'll only use complete checklists (i.e., those for which all birds seen or heard were reported), which will allow us to produce detection/non-detection data. We can apply these filters using the `filter()` function from `dplyr`.
```{r}
#| label: ebird-filter-time
# filter the checklist data
checklists <- checklists |>
filter(all_species_reported,
between(year(observation_date), 2014, 2023),
month(observation_date) == 6)
# filter the observation data
observations <- observations |>
filter(all_species_reported,
between(year(observation_date), 2014, 2023),
month(observation_date) == 6)
```
The data we requested for Georgia using the Custom Download form will include checklists falling in the ocean off the coast of Georgia. Although these oceanic checklists are typically rare, it's usually best to remove them when modeling a terrestrial species like Wood Thrush. We'll using a boundary polygon for Georgia in the `data/gis-data.gpkg` file, buffered by 1 km, to filter our checklist data. A similar approach can be used if you're interested in a custom region, for example, a national park for which you may have a shapefile defining the boundary.
```{r}
#| label: ebird-filter-region
# convert checklist locations to points geometries
checklists_sf <- checklists |>
select(checklist_id, latitude, longitude) |>
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
# boundary of study region, buffered by 1 km
study_region_buffered <- read_sf("data/gis-data.gpkg", layer = "ne_states") |>
filter(state_code == "US-GA") |>
st_transform(crs = st_crs(checklists_sf)) |>
st_buffer(dist = 1000)
# spatially subset the checklists to those in the study region
in_region <- checklists_sf[study_region_buffered, ]
# join to checklists and observations to remove checklists outside region
checklists <- semi_join(checklists, in_region, by = "checklist_id")
observations <- semi_join(observations, in_region, by = "checklist_id")
```
::: callout-tip
## Tip
**It's absolutely critical that we filter the observation and checklist data in exactly the same way** to produce exactly the same population of checklists. Otherwise, the zero-filling we do in the next section will fail.
:::
Finally, there are rare situations in which some observers in a group of shared checklists quite drastically change there checklists, for example, changing the location or switching their checklist from complete to incomplete. In these cases, it's possible to end up with a mismatch between the checklists in the observation dataset and the checklist dataset. We can resolve this very rare issue by removing any checklists from the observation dataset not appearing in the checklist dataset.
```{r}
#| label: ebird-filter-bug
# remove observations without matching checklists
observations <- semi_join(observations, checklists, by = "checklist_id")
```
## Zero-filling {#sec-ebird-zf}
To a large degree, the power of eBird for rigorous analyses comes from the ability to transform the data to produce detection/non-detection data (also referred to as presence/absence data). With presence-only data, but no information of the amount of search effort expended to produce that data, it's challenging even to estimate how rare or common a species is. For example, consider the 10 detections presented in the top row of the figure below, and ask yourself how common is this species? The bottom row of the figure presents three possible scenarios for the total number of checklists that generated the detections, from left to right
- 50 checklists: in this case the species is fairly common with 20% of checklists reporting the species.
- 250 checklists: in this case the species is uncommon with 4% of checklists reporting the species.
- 1,000 checklists: in this case the species is rare with 1% of checklists reporting the species.
```{r}
#| label: ebird-zf-motivation
#| echo: false
#| warning: false
library(patchwork)
set.seed(1)
# generate random points
grid_size <- 0.125
pts <- data.frame(x = runif(1000, 0, 1) |> sqrt(),
y = runif(1000, 0, 1)^2) |>
mutate(obs = c(rep("detection", 10), rep("non-detection", 990)) |>
factor(levels = c("detection", "non-detection")),
x_grid = x %/% grid_size,
y_grid = y %/% grid_size,
grid_cell = paste(x_grid, y_grid, sep = "-"))
# only detections
pts_detections <- filter(pts, obs == "detection")
gg_detections <- ggplot(pts_detections) +
aes(x = x, y, color = obs, size = obs) +
geom_point(show.legend = FALSE) +
scale_color_manual(values = c("#4daf4a", "#55555599")) +
scale_size_discrete(range = c(3, 1)) +
labs(x = NULL, y = NULL) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(),
panel.grid = element_blank())
# high frequency
pts_high <- pts |>
filter(obs == "non-detection") |>
slice_sample(n = 40) |>
bind_rows(pts_detections)
gg_high <- ggplot(pts_high) +
aes(x = x, y, color = obs, size = obs) +
geom_point(show.legend = FALSE) +
scale_color_manual(values = c("#4daf4a", "#55555577")) +
scale_size_discrete(range = c(3, 0.5)) +
labs(x = NULL, y = NULL) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(),
panel.grid = element_blank())
# medium frequency
pts_medium <- pts |>
filter(obs == "non-detection") |>
slice_sample(n = 240) |>
bind_rows(pts_detections)
gg_medium <- ggplot(pts_medium) +
aes(x = x, y, color = obs, size = obs) +
geom_point(show.legend = FALSE) +
scale_color_manual(values = c("#4daf4a", "#55555577")) +
scale_size_discrete(range = c(3, 0.5)) +
labs(x = NULL, y = NULL) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(),
panel.grid = element_blank())
# low frequency
gg_low <- ggplot(pts) +
aes(x = x, y, color = obs, size = obs) +
geom_point(show.legend = FALSE) +
scale_color_manual(values = c("#4daf4a", "#55555577")) +
scale_size_discrete(range = c(3, 0.5)) +
labs(x = NULL, y = NULL) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(),
panel.grid = element_blank())
# arrange layout
design <- "
#1#
234
"
gg_detections + labs(title = "Presence-only",
subtitle = "10 detections") +
gg_high + labs(subtitle = "50 checklists") +
gg_medium + labs(title = "Detection/Non-detection",
subtitle = "250 checklists") +
gg_low + labs(subtitle = "1,000 checklists") +
plot_layout(design = design)
```
::: {.callout-caution icon="false"}
## Exercise
Think of some real world scenarios where presence-only data could be a poor representation of the prevalence of a species.
:::
::: {.callout-note icon="false" collapse="true"}
## Solution
There are many cases of this phenomenon. A couple possible examples include:
- Rare or particularly charismatic species will often have an inflated number of observations because observers specifically seek them out.
- Heavily populated (e.g. cities) or easily accessible areas (e.g. near roads) typically have more detections simply because more people visit them. In contrast, more remote areas may have better habitat for certain species, but have fewer observations because fewer birders are visiting these sites.
:::
::: callout-tip
## Tip
Remember that the prevalence of a species on eBird checklists (e.g. 10% of checklists detected a species) is only a relative measure of the actual occupancy probability of that species. To appear on an eBird checklist, a species must occur in an area and be detected by the observer. That detectability always plays a role in determining prevalence and it can vary drastically between regions, seasons, and species.
:::
The EBD alone is a source of presence-only data, with one record for each taxon (typically species) reported. For complete checklists, information about non-detections can be inferred from the SED: if there is a record in the SED but no record for a species in the EBD, then a count of zero individuals of that species can be inferred. This process is referred to a "zero-filling" the data. We can use [`auk_zerofill()`](https://cornelllabofornithology.github.io/auk/reference/auk_zerofill.html) to combine the checklist and observation data together to produce zero-filled, detection/non-detection data.
```{r}
#| label: ebird-zf-zf
zf <- auk_zerofill(observations, checklists, collapse = TRUE)
```
By default [`auk_zerofill()`](https://cornelllabofornithology.github.io/auk/reference/auk_zerofill.html) returns a compact representation of the data, consisting of a list of two data frames, one with checklist data and the other with observation data; the use of `collapse = TRUE` combines these into a single data frame, which will be easier to work with.
Before continuing, we'll transform some of the variables to a more useful form for modelling. We convert time to a decimal value between 0 and 24, force the distance traveled to 0 for stationary checklists, and create a new variable for speed. Notably, eBirders have the option of entering an "X" rather than a count for a species, to indicate that the species was present, but they didn't keep track of how many individuals were observed. During the modeling stage, we'll want the `observation_count` variable stored as an integer and we'll convert "X" to `NA` to allow for this.
```{r}
#| label: ebird-zf-transform
# function to convert time observation to hours since midnight
time_to_decimal <- function(x) {
x <- hms(x, quiet = TRUE)
hour(x) + minute(x) / 60 + second(x) / 3600
}
# clean up variables
zf <- zf |>
mutate(
# convert count to integer and X to NA
# ignore the warning "NAs introduced by coercion"
observation_count = as.integer(observation_count),
# effort_distance_km to 0 for stationary counts
effort_distance_km = if_else(protocol_type == "Stationary",
0, effort_distance_km),
# convert duration to hours
effort_hours = duration_minutes / 60,
# speed km/h
effort_speed_kmph = effort_distance_km / effort_hours,
# convert time to decimal hours since midnight
hours_of_day = time_to_decimal(time_observations_started),
# split date into year and day of year
year = year(observation_date),
day_of_year = yday(observation_date)
)
```
## Accounting for variation in effort {#sec-ebird-effort}
As discussed in the [Introduction](intro.qmd#sec-intro-intro), variation in effort between checklists makes inference challenging, because it is associated with variation in detectability. When working with semi-structured datasets like eBird, one approach to dealing with this variation is to impose some more consistent structure on the data by filtering observations on the effort variables. This reduces the variation in detectability between checklists. Based on our experience working with these data in the context of eBird Status and Trends, we suggest restricting checklists to traveling or stationary counts less than 6 hours in duration and 10 km in length, at speeds below 100km/h, and with 10 or fewer observers.
```{r}
#| label: ebird-effort-filter
# additional filtering
zf_filtered <- zf |>
filter(protocol_type %in% c("Stationary", "Traveling"),
effort_hours <= 6,
effort_distance_km <= 10,
effort_speed_kmph <= 100,
number_observers <= 10)
```
Note that these filtering parameters are based on making predictions at weekly spatial resolution and 3 km spatial resolution. For applications requiring higher spatial precision, stricter filtering on `effort_distance_km` should be used.
::: {.callout-caution icon="false"}
## Exercise
Pick one of the four effort variables we filtered on above and explore how much variation remains.
:::
::: {.callout-note icon="false" collapse="true"}
## Solution
Let's pick the checklist duration and make a histogram. The large majority of checklists are well under the 6 hour cutoff, with more than half being less than an hour in duration.
```{r}
#| label: ebird-effort-sol
ggplot(zf_filtered) +
aes(x = effort_hours) +
geom_histogram(binwidth = 0.5,
aes(y = after_stat(count / sum(count)))) +
scale_y_continuous(limits = c(0, NA), labels = scales::label_percent()) +
labs(x = "Duration [hours]",
y = "% of eBird checklists",
title = "Distribution of eBird checklist duration")
```
:::
### Spatial precision {#sec-ebird-effort-precision}
For each eBird checklists we are provided a single location (longitude/latitude coordinates) specifying where the bird observations occurred. Ideally we want that checklist location to closely match where an observed bird was located. However, there are three main reasons why that may not be the case:
1. The bird and observer may not overlap in space. For example, raptors may be observed soaring at a great distance from the observer.
2. The checklist location may correspond to an [eBird hotspot](https://support.ebird.org/en/support/solutions/articles/48001009443-ebird-hotspot-faqs) location rather than the location of the observer. eBird hotspots are public birding locations used to aggregate eBird data, for example, so all checklists at a park or wetland can be grouped together. If observers assign their checklists to an eBird hotspot, the coordinates in the ERD and SED will be those of the hotspot. In some cases, these locations will be a good representation of where the observations occurred, in others they may not be a good representation, for example, if the hotspot location is at the center of a large park but the observations occurred at the edge.
3. Traveling checklists survey an area rather than a single point. The amount of area covered will depend on the distance traveled and the compactness of the route taken. For example, a 1 km checklist in a straight line may survey areas further from the checklist coordinates than a 10 km checklist that takes a very circuitous route.
All three of these factors impact the spatial precision of eBird data. Fortunately, many eBird checklists have GPS tracks associated with them. Although these tracks are not available for public use, we can use them to quantify the spatial precision of eBird data. In analyses to inform eBird Status and Trends, we calculated the centroid of all GPS tracks and estimated the distance between that centroid and the reported checklist location. For different maximum checklist distances, we can look at the cumulative distribution of checklists as a function of location error.
![](images/precision_location-error.png)
As a proxy for checklist compactness, we calculated the minimum radius of a circle fully containing the GPS track of each checklist. Again, we plot the cumulative distribution of this compactness measure for different maximum checklist distances.
![](images/precision_compactness.png)
For the specific example of checklists with distances less than 10 km (the cutoff applied above), 94% of traveling checklists are contained within a 1.5 km radius circle and 74% of traveling checklists have location error less than 500 m. This will inform the spatial scale that we use to make predictions in the [next chapter](envvar.qmd#sec-envvar-pred). Depending on the precision required for your application, you can use the above plots to adjust the checklist effort filters to control spatial precision.
## Test-train split {#sec-ebird-testtrain}
For the modeling exercises used in this guide, we'll hold aside a portion of the data from training to be used as an independent test set to assess the predictive performance of the model. Specifically, we'll randomly split the data into 80% of checklists for training and 20% for testing. To facilitate this, we create a new variable `type` that will indicate whether the observation falls in the test set or training set.
```{r}
#| label: ebird-testtrain-split
zf_filtered$type <- if_else(runif(nrow(zf_filtered)) <= 0.8, "train", "test")
# confirm the proportion in each set is correct
table(zf_filtered$type) / nrow(zf_filtered)
```
Finally, there are a large number of variables in the EBD that are redundant (e.g. both state names *and* codes are present) or unnecessary for most modeling exercises (e.g. checklist comments and Important Bird Area codes). These can be removed at this point, keeping only the variables we want for modelling. Then we'll save the resulting zero-filled observations for use in later chapters.
```{r}
#| label: ebird-testtrain-clean
checklists <- zf_filtered |>
select(checklist_id, observer_id, type,
observation_count, species_observed,
state_code, locality_id, latitude, longitude,
protocol_type, all_species_reported,
observation_date, year, day_of_year,
hours_of_day,
effort_hours, effort_distance_km, effort_speed_kmph,
number_observers)
write_csv(checklists, "data/checklists-zf_woothr_jun_us-ga.csv", na = "")
```
If you'd like to ensure you're using exactly the same data as was used to generate this guide, download the [data package](https://github.com/ebird/ebird-best-practices/raw/main/data/ebird-best-practices-data.zip) mentioned in the [setup instructions](intro.qmd#sec-intro-setup-data). Unzip this data package and place the contents in your RStudio project folder.
## Exploratory analysis and visualization {#sec-ebird-explore}
Before proceeding to training species distribution models with these data, it's worth exploring the dataset to see what we're working with. Let's start by making a simple map of the observations. This map uses GIS data available for [download in the data package](https://github.com/ebird/ebird-best-practices/raw/master/data-raw/ebird-best-practices-data.zip). Unzip this data package and place the contents in your RStudio project folder.
```{r}
#| label: ebird-explore-map
#| fig.asp: 1.15
# load gis data
ne_land <- read_sf("data/gis-data.gpkg", "ne_land") |>
st_geometry()
ne_country_lines <- read_sf("data/gis-data.gpkg", "ne_country_lines") |>
st_geometry()
ne_state_lines <- read_sf("data/gis-data.gpkg", "ne_state_lines") |>
st_geometry()
study_region <- read_sf("data/gis-data.gpkg", "ne_states") |>
filter(state_code == "US-GA") |>
st_geometry()
# prepare ebird data for mapping
checklists_sf <- checklists |>
# convert to spatial points
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) |>
select(species_observed)
# map
par(mar = c(0.25, 0.25, 4, 0.25))
# set up plot area
plot(st_geometry(checklists_sf),
main = "Wood Thrush eBird Observations\nJune 2014-2023",
col = NA, border = NA)
# contextual gis data
plot(ne_land, col = "#cfcfcf", border = "#888888", lwd = 0.5, add = TRUE)
plot(study_region, col = "#e6e6e6", border = NA, add = TRUE)
plot(ne_state_lines, col = "#ffffff", lwd = 0.75, add = TRUE)
plot(ne_country_lines, col = "#ffffff", lwd = 1.5, add = TRUE)
# ebird observations
# not observed
plot(filter(checklists_sf, !species_observed),
pch = 19, cex = 0.1, col = alpha("#555555", 0.25),
add = TRUE)
# observed
plot(filter(checklists_sf, species_observed),
pch = 19, cex = 0.3, col = alpha("#4daf4a", 1),
add = TRUE)
# legend
legend("bottomright", bty = "n",
col = c("#555555", "#4daf4a"),
legend = c("eBird checklist", "Wood Thrush sighting"),
pch = 19)
box()
```
In this map, the spatial bias in eBird data becomes immediately obvious, for example, notice the large number of checklists in areas around Atlanta, the largest city in Georgia, in the northern part of the state.
Exploring the effort variables is also a valuable exercise. For each effort variable, we'll produce both a histogram and a plot of frequency of detection as a function of that effort variable. The histogram will tell us something about birder behavior. For example, what time of day are most people going birding, and for how long? We may also want to note values of the effort variable that have very few observations; predictions made in these regions may be unreliable due to a lack of data. The detection frequency plots tell us how the probability of detecting a species changes with effort.
### Time of day {#sec-ebird-explore-time}
The chance of an observer detecting a bird when present can be highly dependent on time of day. For example, many species exhibit a peak in detection early in the morning during dawn chorus and a secondary peak early in the evening. With this in mind, the first predictor of detection that we'll explore is the time of day at which a checklist was started. We'll summarize the data in 1 hour intervals, then plot them. Since estimates of detection frequency are unreliable when only a small number of checklists are available, we'll only plot hours for which at least 100 checklists are present.
```{r}
#| label: ebird-explore-time
#| asp: 1
# summarize data by hourly bins
breaks <- seq(0, 24)
labels <- breaks[-length(breaks)] + diff(breaks) / 2
checklists_time <- checklists |>
mutate(hour_bins = cut(hours_of_day,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
hour_bins = as.numeric(as.character(hour_bins))) |>
group_by(hour_bins) |>
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
# histogram
g_tod_hist <- ggplot(checklists_time) +
aes(x = hour_bins, y = n_checklists) +
geom_segment(aes(xend = hour_bins, y = 0, yend = n_checklists),
color = "grey50") +
geom_point() +
scale_x_continuous(breaks = seq(0, 24, by = 3), limits = c(0, 24)) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Hours since midnight",
y = "# checklists",
title = "Distribution of observation start times")
# frequency of detection
g_tod_freq <- ggplot(checklists_time |> filter(n_checklists > 100)) +
aes(x = hour_bins, y = det_freq) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(0, 24, by = 3), limits = c(0, 24)) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Hours since midnight",
y = "% checklists with detections",
title = "Detection frequency")
# combine
grid.arrange(g_tod_hist, g_tod_freq)
```
As expected, Wood Thrush detectability is highest early in the morning and quickly falls off as the day progresses. In later chapters, we'll make predictions at the peak time of day for detectability to limit the effect of this variation. The majority of checklist submissions also occurs in the morning; however, there are reasonable numbers of checklists between 6am and 9pm. It's in this region that our model estimates will be most reliable.
### Checklist duration {#sec-ebird-explore-duration}
When we filtered the eBird data in @sec-ebird-effort, we restricted observations to those from checklists 6 hours in duration or shorter to reduce variability. Let's see what sort of variation remains in checklist duration.
```{r}
#| label: ebird-explore-duration
#| asp: 1
# summarize data by hour long bins
breaks <- seq(0, 6)
labels <- breaks[-length(breaks)] + diff(breaks) / 2
checklists_duration <- checklists |>
mutate(duration_bins = cut(effort_hours,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
duration_bins = as.numeric(as.character(duration_bins))) |>
group_by(duration_bins) |>
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
# histogram
g_duration_hist <- ggplot(checklists_duration) +
aes(x = duration_bins, y = n_checklists) +
geom_segment(aes(xend = duration_bins, y = 0, yend = n_checklists),
color = "grey50") +
geom_point() +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Checklist duration [hours]",
y = "# checklists",
title = "Distribution of checklist durations")
# frequency of detection
g_duration_freq <- ggplot(checklists_duration |> filter(n_checklists > 100)) +
aes(x = duration_bins, y = det_freq) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Checklist duration [hours]",
y = "% checklists with detections",
title = "Detection frequency")
# combine
grid.arrange(g_duration_hist, g_duration_freq)
```
The majority of checklists are an hour or shorter and there is a rapid decline in the frequency of checklists with increasing duration. In addition, longer searches yield a higher chance of detecting a Wood Thrush. In many cases, there is a saturation effect, with searches beyond a given length producing little additional benefit; however, here there appears to be a drop off in detection for checklists longer than 3.5 hours.
### Distance traveled {#sec-ebird-explore-distance}
As with checklist duration, we expect *a priori* that the greater the distance someone travels, the greater the probability of encountering at least one Wood Thrush. Let's see if this expectation is met. Note that we have already truncated the data to checklists less than 10 km in length.
```{r}
#| label: ebird-explore-distance
#| asp: 1
# summarize data by 1 km bins
breaks <- seq(0, 10)
labels <- breaks[-length(breaks)] + diff(breaks) / 2
checklists_dist <- checklists |>
mutate(dist_bins = cut(effort_distance_km,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
dist_bins = as.numeric(as.character(dist_bins))) |>
group_by(dist_bins) |>
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
# histogram
g_dist_hist <- ggplot(checklists_dist) +
aes(x = dist_bins, y = n_checklists) +
geom_segment(aes(xend = dist_bins, y = 0, yend = n_checklists),
color = "grey50") +
geom_point() +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Distance travelled [km]",
y = "# checklists",
title = "Distribution of distance travelled")
# frequency of detection
g_dist_freq <- ggplot(checklists_dist |> filter(n_checklists > 100)) +
aes(x = dist_bins, y = det_freq) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Distance travelled [km]",
y = "% checklists with detections",
title = "Detection frequency")
# combine
grid.arrange(g_dist_hist, g_dist_freq)
```
As with duration, the majority of observations are from short checklists (less than half a kilometer). One fortunate consequence of this is that most checklists will be contained within a small area within which habitat is not likely to show high variability. In [Chapter -@sec-envvar], we will summarize land cover data within circles 3 km in diameter, centered on each checklist, and it appears that the vast majority of checklists will stay contained within this area.
### Number of observers {#sec-ebird-explore-observers}
Finally, let's consider the number of observers whose observation are being reported in each checklist. We expect that at least up to some number of observers, reporting rates will increase; however, in working with these data we have found cases of declining detection rates for very large groups. With this in mind we have already restricted checklists to those with 30 or fewer observers, thus removing the very largest groups (prior to filtering, some checklists had as many as 180 observers!).
```{r}
#| label: ebird-explore-observers
#| fig-asp: 1
# summarize data
breaks <- seq(0, 10)
labels <- seq(1, 10)
checklists_obs <- checklists |>
mutate(obs_bins = cut(number_observers,
breaks = breaks,
label = labels,
include.lowest = TRUE),
obs_bins = as.numeric(as.character(obs_bins))) |>
group_by(obs_bins) |>
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
# histogram
g_obs_hist <- ggplot(checklists_obs) +
aes(x = obs_bins, y = n_checklists) +
geom_segment(aes(xend = obs_bins, y = 0, yend = n_checklists),
color = "grey50") +
geom_point() +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(labels = scales::comma) +
labs(x = "# observers",
y = "# checklists",
title = "Distribution of the number of observers")
# frequency of detection
g_obs_freq <- ggplot(checklists_obs |> filter(n_checklists > 100)) +
aes(x = obs_bins, y = det_freq) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(labels = scales::percent) +
labs(x = "# observers",
y = "% checklists with detections",
title = "Detection frequency")
# combine
grid.arrange(g_obs_hist, g_obs_freq)
```
The majority of checklists have one or two observers and there appears to be an increase in detection frequency with more observers. However, it's hard to distinguish a discernible pattern in the noise here, likely because there are so few checklists with more than 3 observers.