Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to search the GrSciColl Collection descriptors #558

Open
MortenHofft opened this issue Mar 8, 2024 · 6 comments
Open

How to search the GrSciColl Collection descriptors #558

MortenHofft opened this issue Mar 8, 2024 · 6 comments
Labels
GRSciColl Issues related to institutions, collections and staff

Comments

@MortenHofft
Copy link
Member

@ManonGros here is an attempt to describe what we would like to do with those collection descriptors. This is what I have understood from our conversations, but I've tried to make it slightly closer to something that can be implemented.

Could you please see if it makes sense or I'm misguided.

A collection roughly looks like

title
description
other free text fields
collectionDescriptors: 
  title: title of the csv
  body: free text attached to the csv
  rows: [{scientificName, countryCode, sex, preparations, hasTypes, individualCount}]

We would like to have search

We should be able to search the same fields as now: freetext, institutionKey, code, country, numberOfSpecimens, etc

And then some new fields that is based on the collection descriptors (CD)

scientificName: Only one value per row in the CSVs. interpret same as occurrences.
just like occurrences we add the higher ranks so that users can search with higher taxa.

country: Only one value per row in the CSVs. interpret same as occurrences.
We could possibly infer continent or other regions. But for a first pass just having country would be fine.

individualCount: not the same interpretation as in the occurrence index I would assume. no value doesn't imply exactly 1 specimen.

identifiedBy: pipe seperated. interpret same as occurrences
recordedBy: pipe seperated. interpret same as occurrences
typeStatus: @ManonGros this is normally a pipe separated field. That means that we cannot use it for charts - at least not with specimen counts

There are probably other fields such as year(range) and discipline, but for deciding on an implementation above might be enough?

Results

I imagine the results would be displayed as collections as the entity.
So a search result for filter taxonKey=puma & country=MX & q=male & hl=true would provide a result like

{
  limit: 20,
  offset: 0,
  total: 32,
  results: [
    {
      title: colletion title
      desscription: bla bla bla <em>male</em> pumas, ...
      highlightedCollectionDescriptors: [ // CDs if any that were matched. Say the first 10?
        {
          scientificName: puma concolor
          countryCode: MX
          year: 1980
          sex: male
          specimens: 12
        },
        ...
      ]
    }
  ]
}

Possibilities of conflicts

Now that we introduce CDs with specimens counts that could clash with the specimenCount on the core record.
Just like the specimensInGbif can. At some point we might want a flag for that. And other oddities, but it is my impression that we can ignore issues like that for now.

Aggregation options would be nice

We cannot do much for roll ups across collections and CSVs unless there is some agreement in how to count. We cannot tell from the numbers alone. So perhaps we should include some flags the collection owners can set to indicate that their CDs support counting and comparison. noDoubleCounting:true could mean that the same specimen is not included in more than one row (not in 2 rows in the same csv and not in 2 distinct csvs).

We could then do agregations for individual collections. And possibly across collections, with some caveats.

aggregation example: facet=specimenCountryCode & facet=countryCode & facet=kingdomKey & facet=decade & facet=preparation & facet=discipline facet=hasTypes

Ideally it would be nice with cardinalities for those as well, but that isn't someting we normally have in our APIs, but for hosted portals we get that directly from Elastisearch

What do we count

What do we count in those facets. It could be collection, collection descriptors or specimens.
I'm guessing that counting collection descriptors is uninteresting. But that both specimen and collection counts are intersting. Collection counts might be intersting when comparing across institutions? (E.g. give me a breakdown of countries and list how many collections have data about each - "Ohh that is interesting - there is only one collection in the world stat states it has information about butterflies in Pakistan").

But normally for an endpoint when we do facets we count the entries, so this would be different. Or we could have 2 types of facets. specimenFacets and (collection)Facets.

@ManonGros - what is your thoughts?

@ManonGros
Copy link
Contributor

ManonGros commented Mar 11, 2024

Thanks @MortenHofft, this is really nice! I have been wanting to write this down for 2 weeks now (good you did it).

A collection roughly looks like

title
description
other free text fields
collectionDescriptors: 
  title: title of the csv
  body: free text attached to the csv
  rows: [{scientificName, countryCode, sex, preparations, hasTypes, individualCount}]

Yes! And it would have several tables of descriptors which can have (or not) some of the same fields (in that case numbers aren't comparable).

And then some new fields that is based on the collection descriptors (CD)

Yes to all the fields! I think when it comes to multivalued fields, the easiest would probably be to align with how we process the same fields for occurrences. It will be easier to explain and make more sense.
If institutions want to provide more specific counts, they could just put one line per value.

There are more fields than those that I would like to index including some Latimer Core term that would require controlled values. This doesn't have to happen in the first implementation.
Here are the fields I had in mind:

  • dwc:scientificName,
  • dwc:country or dwc:countryCode,
  • dwc:individualCount (estimated number of specimens),
  • dwc:identifiedBy,
  • dwc:dateIdentified
  • dwc:typeStatus
  • dwc:recordedBy
  • ltc:discipline (which would use a controlled vocabulary)
  • ltc:objectClassificationName (which would use a controlled vocabulary)
  • MIDS (Minimum Information about a Digital Specimen) see also this TDWG -> This is modelled in Latimer Core as measurements or fact with a specific unit. It seems like many institutions (in Europe) are using those to keep track of digitisation progress. It would help the community to be able to filter collections based on MIDs levels. I will make an issue specifically for that and we can explore more. -> not for the first implementation, I would like the MIDS to be defined by standards first.

Results

Yes! that would be perfect!

Possibilities of conflicts

I think you are right and we have to ignore this for now.

Aggregation options would be nice

I agree that aggregation would be nice and that we likely would like specimens and collections. Not that if this is too tricky to implement, the priority is to make the collections discoverable, not to make metrics.

Also linking to this issue containing examples: #557

@ManonGros ManonGros added the GRSciColl Issues related to institutions, collections and staff label Mar 11, 2024
@marcos-lg
Copy link
Contributor

marcos-lg commented Jul 12, 2024

Deployed a first version to dev2:

These params can be used for searching: https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/model/collections/request/DescriptorSetSearchRequest.java

These params can be used for searching: https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/model/collections/request/DescriptorSearchRequest.java

The docs are not ready yet, but for example this is a request to import a descritptors file:

curl --location 'https://api.gbif-dev2.org/v1/grscicoll/collection/10466f44-bf11-4856-912c-b105d66ec08a/descriptorSet'
--header 'Authorization: Basic XXXX'
--form 'descriptorsFile=@"/path/descriptors2.csv"'
--form 'title="test title"'
--form 'description="description"'
--form 'format="TSV"'

The endpoint https://api.gbif-dev2.org/v1/grscicoll/search keeps working as before and it doesn't support descriptors search.

The same parameters apply as in the normal searches but you can also use the hl to highlight the matches when using the q param, e.g.::

https://api.gbif-dev2.org/v1/grscicoll/collection/search?q=incert&hl=true

The new search params related to the descritptors can be found here:
https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/model/collections/request/CollectionDescriptorsSearchRequest.java

Suggestions for descriptors have not been implemented yet.

@marcos-lg
Copy link
Contributor

@MortenHofft
Copy link
Member Author

@marcos-lg writes

Suggestions for descriptors have not been implemented yet.

Is that worth implementing? i had hoped that if a collection owner was engaged enough to create CSVs with holding they would also have edit access. And casual users that just want to correct typos etc wouldn't know the content of the collection anyhow. How is it used Marie? Is it worth spending energy on suggestions for CDs?

@ManonGros
Copy link
Contributor

Thanks @MortenHofft this is a good question.

I think that that people who would have the tables wouldn't necessarily have an account on GBIF and/or permission to edit the entry on GRSciColl. Most people who update their (own) entries don't have an account.
The suggestion system was key to getting the community to update GRSciColl. I think because it is easy, you don't need an account and there is some safety (someone checks before applying the change).

I think that having a suggestion system to upload a table would be helpful. It removes the extra steps of creating an account on GBIF and sending an email to ask editing permissions.
Ideally, I would like anyone to be able to upload tables (which wouldn't be actually on GRSciColl until the editor or mediator approves it).

With that in mind, we could have a first phase where only editors/mediators upload the tables. We could work on the suggestion system later if we get the sense that this would really facilitate getting descriptors in.
Would that make more sense?
Also tagging @marcos-lg

@ManonGros
Copy link
Contributor

Synchronisation with IH

IH makes available Collections summary which contain breakdown of collections. I think it would be great to make these into collection descriptor tables.
However, there are a few things that we need to do in order to make these tables into descriptors.

  1. The Num. of Specimens to dwc:individualCount
  2. Num. Databased and Num. Imaged don't really have any Latimer core or Darwin Core equivalent (other than measurements or facts but this is another story) so we should leave them unmapped.
  3. The IH breakdowns don't always correspond to monophyletic taxa, which means that they can't all be mapped to scientific names. This means that we should add an additional column with scientific names for some group so the collection can be found by users who search by scientific names. The original groups should also be mapped to ltc:objectClassificationName.
  4. Groups with no information should be excluded from the table so that the collection doesn't appear in any search. For example, this entry: https://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126868 shouldn't have a table. Should it?

Mapping in practice

Here is an example of a table from NY (https://sweetgum.nybg.org/science/ih/herbarium-details/?irn=125525):

  Num. of Specimens Num. Databased Num. Imaged
Algae 221000 155365 110822
Bryophytes 700000 448769 452485
Fungi/Lichens 700000 685635 441587
Pteridophytes 300000 223911 206980
Seed Plants 6000000 2686457 2228503

This is how I would like to see it mapped

ltc:objectClassificationName  dwc:scientificName dwc:individualCount Num. Databased Num. Imaged
Algae 221000 155365 110822
Bryophytes Bryophyta 700000 448769 452485
Fungi/Lichens Fungi 700000 685635 441587
Pteridophytes Pteridophyta 300000 223911 206980
Seed Plants 6000000 2686457 2228503

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff
Projects
None yet
Development

No branches or pull requests

3 participants