GitHub - jhnwllr/GBIF_additional_filters: companion additional filters page

This is a companion piece to GBIF filtering checklist the blog post here.

Here I will take you through some additional filters that you might want to add. The rest of these filters are a little bit more difficult and might involve more judgment calls, so I put them in this repository.

metagenomics

Metagenomics (see previous blog post) is a new publishing area for GBIF. Without going into too many details, metagenomics samples the environment for DNA and then matches them against an existing reference database. Especially with non-microorganisms these matches can often be incorrect or suspicious.

Currently, there is not a great way for filtering for only metagenomics datasets. There are some dicussions about additional dataset categories but these have not be implemented yet.

As a researcher you might want to check records with material sample or those published by MGnify

# here is a script that will remove *most* metagenomics records 
gbif_download %>%
filter(!basisOfRecord == "MATERIAL_SAMPLE") %>%
filter(!publishingOrgKey == "ab733144-7043-4e88-bd4f-fca7bf858880")

Automated-ids

Some datasets use images to automatically identify occurrence records. Often these identifications can be very accurate. As a user, however, you might want to be aware that these datasets exists:

Pl@ntNet automatically identified occurrences

Not inside IUCN range

CoordinateCleaner also includes a function for filtering based on expert distribution polygons.

Here you must download your polygons first. Not all taxa will have a reliable range polygon.

In my experience, unless your group is well is well-studied (Mammals, Birds, Reptiles see list here) it might be hard to get complete enough coverage for this to be worthwhile, but the coverage is always improving.

Here is an example of filtering using the range from the damselfly Calopteryx xanthostoma. Here you can [download](/post/2020-12-08-typical-user-gbif-data-cleaning_files/Calopteryx xanthostoma/) the range shapefile for this species for the example below. The [gbif_download](/post/2020-12-08-typical-user-gbif-data-cleaning_files/Calopteryx xanthostoma.csv) used for this example.

library(dplyr)
library(CoordinateCleaner) 

# read in directory where you downloaded the shapefile
range_shp = sf::st_read("Calopteryx xanthostoma") 

gbif_download = readr::read_tsv("Calopteryx xanthostoma.csv") %>%
filter(!is.na(decimalLongitude) & !is.na(decimalLatitude)) # need to remove missing coordiantes

spdf = range_shp %>% as("Spatial")
class(spdf) # should be SpatialPolygonsDataFrame

gbif_download %>%
cc_iucn(
range = spdf,
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "species",
buffer = 5, # buffer in decimal degrees
value = "clean"
)

It is usually good to add a buffer to the polygons, to catch any occurrences that might be just outside of the range. I added a 5 decimal degree buffer in this example.

Also beware some IUCN polygons are very large and might run slow or crash your session.

Gridded datasets

Rasterized or gridded datasets are common on GBIF. These are datasets where location information is pinned to a low-resolution grid.

GBIF has an experimental API for identifying datasests which exhibit a certain about of "griddyness". You can read more here.

The api will give you responses like this:

[
  {
    "key": 52366,
    "totalCount": 91,
    "minDist": 1.0,
    "minDistCount": 86,
    "percent": 0.9451,
    "maxPercent": 0.9451
  }
]

The API is experimental and might change in the future.

key : id key for record
totalCount : the count of unique lat-lon points in dataset
minDist : the most common nearest neighbor distance in decimal degrees
minDistCount : the number of unique lat-lon points with the same distance in decimal degrees
percent : the fraction of unique lat-lon points that have the same minDist
maxPercent : the same as percent (will probably be removed).

You could filter out gridded datasets by removing those with a percent higher than around 20% - 50% gridded. You might also want to pick a minDist that fits your needs.

Most publishers of gridded datasets actually fill in one of the following columns:

coordinateuncertaintyinmeters
coordinateprecision
footprintwkt
locationid

So filtering by these columns can also be a good way to remove gridded datasets.

Filter spatial outliers

Sometimes a range polygon is not avaiable or you want a more general purpose way to flag suspicous points.

I have found DBSCAN to be an effective way to remove spatial outliers in patchy GBIF data. You can read more about DBSCAN from a previous post.

The following could be put at the end of a pipeline. Note that you would need to split by species, if you had multiple species.

library(dplyr)
library(dbscan)

gbif_download = readr::read_tsv("Calopteryx xanthostoma.csv") %>%
filter(!is.na(decimalLongitude) | !is.na(decimalLatitude))

gbif_download %>% 
mutate(cluster = 
dbscan::dbscan(
as.matrix(.[,c("decimalLatitude","decimalLongitude")]), 
eps = 15, minPts = 3)$cluster %>%
as.factor() 
) %>%
mutate(dbscan_outlier = ifelse(cluster == 0,TRUE,FALSE)) %>%
filter(!dbscan_outlier)

Filter environmental outliers

Removing environmental outliers can also be important for certain applications. This can be done with using reverse jackknifing. In this case you must also download the environmental data you are interested in using. I will be using bioclim and the R package biogeo for the biogeo::rjack() function.

# The following could be put at the end of a pipeline. 
# Note that you would need to split by species, if you had multiple species.

library(sp)
library(raster)
library(dplyr)
library(purrr)

path = "" # where you want to save the raster data
r = raster::getData('worldclim', var='bio',res=10,path=path) 

gbif_download = readr::read_tsv("Calopteryx xanthostoma.csv") %>%
filter(!is.na(decimalLongitude) | !is.na(decimalLatitude)) 

bioclim_data = sf::st_as_sf(
gbif_download,
coords = c("decimalLongitude", "decimalLatitude"),
crs=4326
) %>%
raster::extract(r,.) %>% # extract values from points
as.data.frame() 

gbif_bioclim = cbind(gbif_download,bioclim_data) %>% 
mutate(row_number = row_number()) # use to match index

gbif_download_with_outliers = bioclim_data %>% 
na.omit() %>% # remove missing climate values
map(~ biogeo::rjack(.x)) %>% # run the reverse jackknife outlier search
compact() %>% # remove columns with no outliers 
map(~gbif_bioclim$row_number %in% .x) %>% 
bind_rows() %>% 
setNames(paste0("rjack_outlier_", names(.))) %>%
mutate(number_of_outliers = rowSums(.)) %>% 
cbind(gbif_bioclim,.) %>%
glimpse() 

# should find two environmental outliers 
gbif_download_with_outliers %>% 
filter(number_of_outliers > 5)

You will want to do any outlier detection at the end of any cleaning pipeline, since noise can reduce the effectiveness of outlier detection.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metagenomics

Automated-ids

Not inside IUCN range

Gridded datasets

Filter spatial outliers

Filter environmental outliers

About

Releases

Packages

jhnwllr/GBIF_additional_filters

Folders and files

Latest commit

History

Repository files navigation

metagenomics

Automated-ids

Not inside IUCN range

Gridded datasets

Filter spatial outliers

Filter environmental outliers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages