How to search the GrSciColl Collection descriptors #558

MortenHofft · 2024-03-08T13:46:49Z

@ManonGros here is an attempt to describe what we would like to do with those collection descriptors. This is what I have understood from our conversations, but I've tried to make it slightly closer to something that can be implemented.

Could you please see if it makes sense or I'm misguided.

A collection roughly looks like

title
description
other free text fields
collectionDescriptors: 
  title: title of the csv
  body: free text attached to the csv
  rows: [{scientificName, countryCode, sex, preparations, hasTypes, individualCount}]

We would like to have search

We should be able to search the same fields as now: freetext, institutionKey, code, country, numberOfSpecimens, etc

And then some new fields that is based on the collection descriptors (CD)

scientificName: Only one value per row in the CSVs. interpret same as occurrences.
just like occurrences we add the higher ranks so that users can search with higher taxa.

country: Only one value per row in the CSVs. interpret same as occurrences.
We could possibly infer continent or other regions. But for a first pass just having country would be fine.

individualCount: not the same interpretation as in the occurrence index I would assume. no value doesn't imply exactly 1 specimen.

identifiedBy: pipe seperated. interpret same as occurrences
recordedBy: pipe seperated. interpret same as occurrences
typeStatus: @ManonGros this is normally a pipe separated field. That means that we cannot use it for charts - at least not with specimen counts

There are probably other fields such as year(range) and discipline, but for deciding on an implementation above might be enough?

Results

I imagine the results would be displayed as collections as the entity.
So a search result for filter taxonKey=puma & country=MX & q=male & hl=true would provide a result like

{
  limit: 20,
  offset: 0,
  total: 32,
  results: [
    {
      title: colletion title
      desscription: bla bla bla <em>male</em> pumas, ...
      highlightedCollectionDescriptors: [ // CDs if any that were matched. Say the first 10?
        {
          scientificName: puma concolor
          countryCode: MX
          year: 1980
          sex: male
          specimens: 12
        },
        ...
      ]
    }
  ]
}

Possibilities of conflicts

Now that we introduce CDs with specimens counts that could clash with the specimenCount on the core record.
Just like the specimensInGbif can. At some point we might want a flag for that. And other oddities, but it is my impression that we can ignore issues like that for now.

Aggregation options would be nice

We cannot do much for roll ups across collections and CSVs unless there is some agreement in how to count. We cannot tell from the numbers alone. So perhaps we should include some flags the collection owners can set to indicate that their CDs support counting and comparison. noDoubleCounting:true could mean that the same specimen is not included in more than one row (not in 2 rows in the same csv and not in 2 distinct csvs).

We could then do agregations for individual collections. And possibly across collections, with some caveats.

aggregation example: facet=specimenCountryCode & facet=countryCode & facet=kingdomKey & facet=decade & facet=preparation & facet=discipline facet=hasTypes

Ideally it would be nice with cardinalities for those as well, but that isn't someting we normally have in our APIs, but for hosted portals we get that directly from Elastisearch

What do we count

What do we count in those facets. It could be collection, collection descriptors or specimens.
I'm guessing that counting collection descriptors is uninteresting. But that both specimen and collection counts are intersting. Collection counts might be intersting when comparing across institutions? (E.g. give me a breakdown of countries and list how many collections have data about each - "Ohh that is interesting - there is only one collection in the world stat states it has information about butterflies in Pakistan").

But normally for an endpoint when we do facets we count the entries, so this would be different. Or we could have 2 types of facets. specimenFacets and (collection)Facets.

@ManonGros - what is your thoughts?

The text was updated successfully, but these errors were encountered:

ManonGros · 2024-03-11T15:19:10Z

Thanks @MortenHofft, this is really nice! I have been wanting to write this down for 2 weeks now (good you did it).

A collection roughly looks like

title
description
other free text fields
collectionDescriptors: 
  title: title of the csv
  body: free text attached to the csv
  rows: [{scientificName, countryCode, sex, preparations, hasTypes, individualCount}]

Yes! And it would have several tables of descriptors which can have (or not) some of the same fields (in that case numbers aren't comparable).

And then some new fields that is based on the collection descriptors (CD)

Yes to all the fields! I think when it comes to multivalued fields, the easiest would probably be to align with how we process the same fields for occurrences. It will be easier to explain and make more sense.
If institutions want to provide more specific counts, they could just put one line per value.

There are more fields than those that I would like to index including some Latimer Core term that would require controlled values. This doesn't have to happen in the first implementation.
Here are the fields I had in mind:

dwc:scientificName,
dwc:country or dwc:countryCode,
dwc:individualCount (estimated number of specimens),
dwc:identifiedBy,
dwc:dateIdentified
dwc:typeStatus
dwc:recordedBy
ltc:discipline (which would use a controlled vocabulary)
ltc:objectClassificationName (which would use a controlled vocabulary)
MIDS (Minimum Information about a Digital Specimen) see also this TDWG -> This is modelled in Latimer Core as measurements or fact with a specific unit. It seems like many institutions (in Europe) are using those to keep track of digitisation progress. It would help the community to be able to filter collections based on MIDs levels. I will make an issue specifically for that and we can explore more. -> not for the first implementation, I would like the MIDS to be defined by standards first.

Results

Yes! that would be perfect!

Possibilities of conflicts

I think you are right and we have to ignore this for now.

Aggregation options would be nice

I agree that aggregation would be nice and that we likely would like specimens and collections. Not that if this is too tricky to implement, the priority is to make the collections discoverable, not to make metrics.

Also linking to this issue containing examples: #557

marcos-lg · 2024-07-12T07:42:56Z

Deployed a first version to dev2:

List all descriptor sets of a collection:
https://api.gbif-dev2.org/v1/grscicoll/collection/10466f44-bf11-4856-912c-b105d66ec08a/descriptorSet

These params can be used for searching: https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/model/collections/request/DescriptorSetSearchRequest.java

Lists all descriptors of a descriptor set:
https://api.gbif-dev2.org/v1/grscicoll/collection/10466f44-bf11-4856-912c-b105d66ec08a/descriptorSet/3/descriptor

These params can be used for searching: https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/model/collections/request/DescriptorSearchRequest.java

The docs are not ready yet, but for example this is a request to import a descritptors file:

curl --location 'https://api.gbif-dev2.org/v1/grscicoll/collection/10466f44-bf11-4856-912c-b105d66ec08a/descriptorSet'
--header 'Authorization: Basic XXXX'
--form 'descriptorsFile=@"/path/descriptors2.csv"'
--form 'title="test title"'
--form 'description="description"'
--form 'format="TSV"'

New search endpoints:
https://api.gbif-dev2.org/v1/grscicoll/institution/search
https://api.gbif-dev2.org/v1/grscicoll/collection/search

The endpoint https://api.gbif-dev2.org/v1/grscicoll/search keeps working as before and it doesn't support descriptors search.

The same parameters apply as in the normal searches but you can also use the hl to highlight the matches when using the q param, e.g.::

https://api.gbif-dev2.org/v1/grscicoll/collection/search?q=incert&hl=true

The new search params related to the descritptors can be found here:
https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/model/collections/request/CollectionDescriptorsSearchRequest.java

Suggestions for descriptors have not been implemented yet.

marcos-lg · 2024-07-26T11:35:24Z

I've also deployed it to UAT2 and imported some descriptors for these collections:

https://api.gbif-uat2.org/v1/grscicoll/collection/e5097454-2826-473a-b610-05e15ccd7ad2/descriptorSet
https://api.gbif-uat2.org/v1/grscicoll/collection/d7aa9664-36da-4bfa-a07a-7f4e48ea4cc0/descriptorSet

MortenHofft · 2024-08-01T06:31:33Z

@marcos-lg writes

Suggestions for descriptors have not been implemented yet.

Is that worth implementing? i had hoped that if a collection owner was engaged enough to create CSVs with holding they would also have edit access. And casual users that just want to correct typos etc wouldn't know the content of the collection anyhow. How is it used Marie? Is it worth spending energy on suggestions for CDs?

ManonGros · 2024-08-01T07:59:21Z

Thanks @MortenHofft this is a good question.

I think that that people who would have the tables wouldn't necessarily have an account on GBIF and/or permission to edit the entry on GRSciColl. Most people who update their (own) entries don't have an account.
The suggestion system was key to getting the community to update GRSciColl. I think because it is easy, you don't need an account and there is some safety (someone checks before applying the change).

I think that having a suggestion system to upload a table would be helpful. It removes the extra steps of creating an account on GBIF and sending an email to ask editing permissions.
Ideally, I would like anyone to be able to upload tables (which wouldn't be actually on GRSciColl until the editor or mediator approves it).

With that in mind, we could have a first phase where only editors/mediators upload the tables. We could work on the suggestion system later if we get the sense that this would really facilitate getting descriptors in.
Would that make more sense?
Also tagging @marcos-lg

ManonGros · 2024-08-05T11:55:44Z

Synchronisation with IH

IH makes available Collections summary which contain breakdown of collections. I think it would be great to make these into collection descriptor tables.
However, there are a few things that we need to do in order to make these tables into descriptors.

The Num. of Specimens to dwc:individualCount
Num. Databased and Num. Imaged don't really have any Latimer core or Darwin Core equivalent (other than measurements or facts but this is another story) so we should leave them unmapped.
The IH breakdowns don't always correspond to monophyletic taxa, which means that they can't all be mapped to scientific names. This means that we should add an additional column with scientific names for some group so the collection can be found by users who search by scientific names. The original groups should also be mapped to ltc:objectClassificationName.
Groups with no information should be excluded from the table so that the collection doesn't appear in any search. For example, this entry: https://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126868 shouldn't have a table. Should it?

Mapping in practice

Here is an example of a table from NY (https://sweetgum.nybg.org/science/ih/herbarium-details/?irn=125525):

	Num. of Specimens	Num. Databased	Num. Imaged
Algae	221000	155365	110822
Bryophytes	700000	448769	452485
Fungi/Lichens	700000	685635	441587
Pteridophytes	300000	223911	206980
Seed Plants	6000000	2686457	2228503

This is how I would like to see it mapped

`ltc:objectClassificationName`	`dwc:scientificName`	`dwc:individualCount`	Num. Databased	Num. Imaged
Algae		221000	155365	110822
Bryophytes	Bryophyta	700000	448769	452485
Fungi/Lichens	Fungi	700000	685635	441587
Pteridophytes	Pteridophyta	300000	223911	206980
Seed Plants		6000000	2686457	2228503

ManonGros added the GRSciColl Issues related to institutions, collections and staff label Mar 11, 2024

ManonGros added this to the GRSciColl Roadmap - Support structured collection descriptors milestone Mar 11, 2024

MortenHofft mentioned this issue Jul 31, 2024

Collection descriptors gbif/gbif-web#614

Open

MortenHofft mentioned this issue Aug 2, 2024

Add Collection descriptors gbif/registry-console#578

Closed

marcos-lg mentioned this issue Sep 5, 2024

Implement suggestions for collection descriptors #612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to search the GrSciColl Collection descriptors #558

How to search the GrSciColl Collection descriptors #558

MortenHofft commented Mar 8, 2024

ManonGros commented Mar 11, 2024 •

edited

Loading

marcos-lg commented Jul 12, 2024 •

edited

Loading

marcos-lg commented Jul 26, 2024

MortenHofft commented Aug 1, 2024

ManonGros commented Aug 1, 2024

ManonGros commented Aug 5, 2024

How to search the GrSciColl Collection descriptors #558

How to search the GrSciColl Collection descriptors #558

Comments

MortenHofft commented Mar 8, 2024

A collection roughly looks like

We would like to have search

And then some new fields that is based on the collection descriptors (CD)

Results

Possibilities of conflicts

Aggregation options would be nice

What do we count

ManonGros commented Mar 11, 2024 • edited Loading

marcos-lg commented Jul 12, 2024 • edited Loading

marcos-lg commented Jul 26, 2024

MortenHofft commented Aug 1, 2024

ManonGros commented Aug 1, 2024

ManonGros commented Aug 5, 2024

Synchronisation with IH

Mapping in practice

ManonGros commented Mar 11, 2024 •

edited

Loading

marcos-lg commented Jul 12, 2024 •

edited

Loading