This is a companion piece to GBIF filtering checklist the blog post here.
Here I will take you through some additional filters that you might want to add. The rest of these filters are a little bit more difficult and might involve more judgment calls, so I put them in this repository.
Metagenomics (see previous blog post) is a new publishing area for GBIF. Without going into too many details, metagenomics samples the environment for DNA and then matches them against an existing reference database. Especially with non-microorganisms these matches can often be incorrect or suspicious.
Currently, there is not a great way for filtering for only metagenomics datasets. There are some dicussions about additional dataset categories but these have not be implemented yet.
As a researcher you might want to check records with material sample or those published by MGnify
# here is a script that will remove *most* metagenomics records
gbif_download %>%
filter(!basisOfRecord == "MATERIAL_SAMPLE") %>%
filter(!publishingOrgKey == "ab733144-7043-4e88-bd4f-fca7bf858880")
Some datasets use images to automatically identify occurrence records. Often these identifications can be very accurate. As a user, however, you might want to be aware that these datasets exists:
Pl@ntNet automatically identified occurrences
CoordinateCleaner also includes a function for filtering based on expert distribution polygons.
Here you must download your polygons first. Not all taxa will have a reliable range polygon.
In my experience, unless your group is well is well-studied (Mammals, Birds, Reptiles see list here) it might be hard to get complete enough coverage for this to be worthwhile, but the coverage is always improving.
Here is an example of filtering using the range from the damselfly Calopteryx xanthostoma. Here you can [download](/post/2020-12-08-typical-user-gbif-data-cleaning_files/Calopteryx xanthostoma/) the range shapefile for this species for the example below. The [gbif_download](/post/2020-12-08-typical-user-gbif-data-cleaning_files/Calopteryx xanthostoma.csv) used for this example.
library(dplyr)
library(CoordinateCleaner)
# read in directory where you downloaded the shapefile
range_shp = sf::st_read("Calopteryx xanthostoma")
gbif_download = readr::read_tsv("Calopteryx xanthostoma.csv") %>%
filter(!is.na(decimalLongitude) & !is.na(decimalLatitude)) # need to remove missing coordiantes
spdf = range_shp %>% as("Spatial")
class(spdf) # should be SpatialPolygonsDataFrame
gbif_download %>%
cc_iucn(
range = spdf,
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "species",
buffer = 5, # buffer in decimal degrees
value = "clean"
)
It is usually good to add a buffer to the polygons, to catch any occurrences that might be just outside of the range. I added a 5 decimal degree buffer in this example.
Also beware some IUCN polygons are very large and might run slow or crash your session.
Rasterized or gridded datasets are common on GBIF. These are datasets where location information is pinned to a low-resolution grid.
GBIF has an experimental API for identifying datasests which exhibit a certain about of "griddyness". You can read more here.
The api will give you responses like this:
[
{
"key": 52366,
"totalCount": 91,
"minDist": 1.0,
"minDistCount": 86,
"percent": 0.9451,
"maxPercent": 0.9451
}
]
The API is experimental and might change in the future.
- key : id key for record
- totalCount : the count of unique lat-lon points in dataset
- minDist : the most common nearest neighbor distance in decimal degrees
- minDistCount : the number of unique lat-lon points with the same distance in decimal degrees
- percent : the fraction of unique lat-lon points that have the same minDist
- maxPercent : the same as percent (will probably be removed).
You could filter out gridded datasets by removing those with a percent higher than around 20% - 50% gridded. You might also want to pick a minDist that fits your needs.
Most publishers of gridded datasets actually fill in one of the following columns:
- coordinateuncertaintyinmeters
- coordinateprecision
- footprintwkt
- locationid
So filtering by these columns can also be a good way to remove gridded datasets.
Sometimes a range polygon is not avaiable or you want a more general purpose way to flag suspicous points.
I have found DBSCAN to be an effective way to remove spatial outliers in patchy GBIF data. You can read more about DBSCAN from a previous post.
The following could be put at the end of a pipeline. Note that you would need to split by species, if you had multiple species.
library(dplyr)
library(dbscan)
gbif_download = readr::read_tsv("Calopteryx xanthostoma.csv") %>%
filter(!is.na(decimalLongitude) | !is.na(decimalLatitude))
gbif_download %>%
mutate(cluster =
dbscan::dbscan(
as.matrix(.[,c("decimalLatitude","decimalLongitude")]),
eps = 15, minPts = 3)$cluster %>%
as.factor()
) %>%
mutate(dbscan_outlier = ifelse(cluster == 0,TRUE,FALSE)) %>%
filter(!dbscan_outlier)
Removing environmental outliers can also be important for certain applications. This can be done with using reverse jackknifing. In this case you must also download the environmental data you are interested in using. I will be using bioclim and the R package biogeo for the biogeo::rjack()
function.
# The following could be put at the end of a pipeline.
# Note that you would need to split by species, if you had multiple species.
library(sp)
library(raster)
library(dplyr)
library(purrr)
path = "" # where you want to save the raster data
r = raster::getData('worldclim', var='bio',res=10,path=path)
gbif_download = readr::read_tsv("Calopteryx xanthostoma.csv") %>%
filter(!is.na(decimalLongitude) | !is.na(decimalLatitude))
bioclim_data = sf::st_as_sf(
gbif_download,
coords = c("decimalLongitude", "decimalLatitude"),
crs=4326
) %>%
raster::extract(r,.) %>% # extract values from points
as.data.frame()
gbif_bioclim = cbind(gbif_download,bioclim_data) %>%
mutate(row_number = row_number()) # use to match index
gbif_download_with_outliers = bioclim_data %>%
na.omit() %>% # remove missing climate values
map(~ biogeo::rjack(.x)) %>% # run the reverse jackknife outlier search
compact() %>% # remove columns with no outliers
map(~gbif_bioclim$row_number %in% .x) %>%
bind_rows() %>%
setNames(paste0("rjack_outlier_", names(.))) %>%
mutate(number_of_outliers = rowSums(.)) %>%
cbind(gbif_bioclim,.) %>%
glimpse()
# should find two environmental outliers
gbif_download_with_outliers %>%
filter(number_of_outliers > 5)
You will want to do any outlier detection at the end of any cleaning pipeline, since noise can reduce the effectiveness of outlier detection.