Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple API for resources #1286

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Simple API for resources #1286

wants to merge 2 commits into from

Conversation

diogok
Copy link

@diogok diogok commented Oct 7, 2016

Hello all,

Issue #1249 got my attention these days, and, while I agree that index and provide and an API upon the data it itself is out of scope for IPT, I beleave that a proper way to consume resource information (as in resources links, files, metadata and such) as a webservice would have value and ease the development of tools on top of IPT, such as the needed harversters and public interfaces.

This is mostly a working idea to discuss and maybe build upon.

This includes two endpoints: List of resources and resource details, with proper linking and data exposure.

Gist of how the data is exposed.

@kbraak
Copy link
Contributor

kbraak commented Oct 12, 2016

Thank you for taking the time to contribute this pull request @diogok.

Please review my latest comment on issue #1249

Specifically please take note of all the functionality offered by the GBIF API as well as the Registered dataset inventory.

If you still believe this request is needed and the above services don't satisfy your needs, please create a new issue to raise its visibility. When doing so, please be sure to explain the rationale and use cases for it as well. That way it is more visible and understandable to the wider community and can be voted on. Thanks

@dgasl
Copy link

dgasl commented Feb 18, 2017

I so much agree with both points of view. And I don't see any incompatibilites between them.
Wouldn't this be possible with an optional IPT-api plugin, so whoever needs an api access to IPT can install it? (something like the IPT geoserver plugin which already exists).

GBIF API is my preferred way of accessing IPT published data for most purposes. I am more than happy enough with it, so I am not really rushed about having an alternative direct access through an IPT api. But if somebody (@diogok , @tigreped , @dvdscripter) wants to create a new issue to raise this need as @kbraak suggests, I will left some ideas here and explain the rationale and use cases for it. Feel free to use them:

GBIF api certainly fails when you want to provide easy access to IPT datasets which you do still NOT want to be visible at GBIF data portal. This is not uncommon:

  • A dataset could be just part of a bigger one which is already published. You don't want those data be published twice at GBIF portal, but you still want them to be accessible from IPT: a set of data from your museum related to a certain project (which needs its own webpage access) would be a common use case.
  • You have a dataset which needs a particular/private access (i.e., letting local authorities and/or certain researchers access your data about endangered populations which might be better keep hidden to general public).
  • Your dataset might still not be ready for GBIF publication, but you want to make it accessible to people who is digitizing/using/updating it (no matters if these are research data, citizen data, students data, ...).

OK. All these are things that you could do creating a web service on your own, using whatever technologies you want. But that forces you to know how to do that, or have other people available that can do it for you. Why not take profit of IPT for doing it, not needing to develop anything else to provide your data?
I would also say this could make more people willing to provide data to GBIF, as the convenience of using IPT for their personal projects will put them just at few clicks of distance from sharing them in GBIF (sooner or later).

And the other way round, sometimes GBIF api also fail providing easy/correct/quick access to some of the IPT provided info which you DO want to be visible. This is not uncommon:

  • You might want to make your original data accessible avoiding GBIF interpretation issues. For example, I found no way to use GBIF api to search taxon names which are not listed in the gbif backbone (but this would be possible searching at IPT level).
    This is so common with recently described taxa, which could take years to be visible in GBIF. Example:
    This is a type specimen of a new species -but already 3.5 years old- described as Peltula lobata, but as the name is still not in its backbone, GBIF exposes its name as Peltula (matching it to the higher taxon rank). It is still visible at verbatim version of the record. But it is not searchable. Because of this, many interesting type specimens' names keep hidden to (re)searchers. Look how frequent this is: 95% of the specimens returned in the first page of a simple TypeStatus search are exposed by GBIF with a wrong name (due to "Taxon match higher rank" GBIF interpretation issues).

  • Naming biological specimens is a subjective task, but GBIF api can only expose one name per specimen. Many times there is no agreement on the correct name which should be applied to an specimen, since it contains several identification labels applied by different researchers. In those cases you would like to expose them all using the Darwincore Identification History Extension, but this will not make them searchable anyway. You'll have to select one of them as the "current one", and AFAIK that will the only name searchable through GBIF api: this is not fair to those researchers whose taxonomic opinion (perhaps the most correct one) is kept hidden.
    You might argue this kind of "taxonomically unresolved" specimens should not be published to GBIF. But I would say this unresolution could make them even more interesting to certain researchers.
    The other only chance to expose them all is to publish them as separate occurrences, but this would add confusion and noise (a fake increase in the number of specimens and diversity of taxa).
    So we either go back to the point of "let my IPT have an api to expose unpublished datasets" (to avoid repeated occurrences being published to GBIF), or we publish them to GBIF with multiple names per specimen (DwC-extension), but GBIF api should be able to make them all searchable and usable.
    I have the feeling that in a single IPT api, this "make-all-names-of-a-single-occurrence-searchable" issue would be much easier to solve.

  • Your dataset is published, but GBIF does not always show updated information. And you want to give an alternative immediate access to it, not waiting for GBIF harvesting/reindexing. Here an IPT api access becomes handy again.
    From my previous experience, when I republished a dataset adding new records, these used to be visible in GBIF portal in just a few minutes. But GBIF is not always as immediate as you want. Some examples:
    ·
    1)- Delays due to huge inventory datasets being indexed, as @timrobertson100 commented here. In that case, 12 days of expected delay! Perhaps this is uncommon, but it might break your workflow.
    ·
    2)- When IPT resource is updated but record counts do not change. I am not sure about what happens in this case. ¿Nothing?
    Look at @kbraak last comment on issue API for exploring local ipt resources #1249 where he mentions the registered dataset inventory. This IPT hidden access is quite useful. But its description says "GBIF uses this inventory to monitor whether it is properly indexing resources by comparing the target and indexed record counts".
    .
    Does this mean GBIF will not reharvest an updated IPT resource if record counts do not change?
    This is not unusual in my institution (for example, when some already published specimens are updated with new georreference information, or their taxonomic indentifications are reviewed by a taxonomist -OMG, now we have to make a decission: should this opinion be raised to GBIF instead of the previous one?). The resource record count will not change at all. What does GBIF do in those cases? If the registered dataset inventory is checked in the proposed way, I guess these changes will not be harvested by GBIF.
    .
    Trying to check this, I updated a dataset a couple of days ago (Thu, 16 Feb 2017 23:07:42 +0100), adding some new DwC terms (RecordedBy & EventDate) previously unmapped. But NO new records were added, so recordcount did not change at all:
    http://www.gbif.org/occurrence/search?datasetKey=df3eab30-0837-11d9-acb2-b8a03c50a862
    GBIF portal resource page is still not showing those fields (Mon, 20 Feb 2017 01:43:04 +0100).
    .
    I can't say the delay is because recordcount did not change, or it might be due to other reasons. But, anyway, there is a delay of several days.
    Just in case, I suggest that GBIF should not be using registered dataset inventory "records" count to decide wether a certain dataset should be reindexed. The most important thing is to check "lastPublished" value.
    And I take the opportunity to request an improvement to that value. Why not to make it a DateTime value? (currently, it only exposes a Date)

I will update the list if I figure out any other issues.

Thanks a lot
David

@kbraak
Copy link
Contributor

kbraak commented Feb 20, 2017

Thanks David,

All your comments and ideas have been noted. Below I address a few of your points with some information that is hopefully useful to you.

@mdoering is adding all type specimen names from occurrences to the backbone:
gbif/checklistbank#10

You should already be able to do an occurrence search on the original scientific name in the upcoming version of GBIF.org: https://demo.gbif.org/occurrence/search Here you can also try doing a free-text search on the original scientific.

GBIF exposes the identifications included in the Identification History Extension, e.g. https://demo.gbif.org/occurrence/1211192227 They should be searchable, at least using the free-text search. Please feel free to do more testing and provide additional feedback.

I definitely agree GBIF.org should do better at warning users when indexing is off and explaining how indexing happens. Thanks for supporting gbif/portal16#6 that aims to address this.

In case it helps, please see this page created by @dnoesgaard that gives an easy to read overview of the data publishing workflow: https://demo.gbif.org/tools/gbif-software

When browsing https://demo.gbif.org, please feel free to log issues for anything that is missing (e.g. more fine-detail documentation explaining when GBIF reindexes datasets), or when you notice a feature isn't working properly (e.g. occurrence search filtered by original scientific name). In case you weren't aware, you can submit feedback easily using the message box in top right hand corner of each page.

Ultimately we aim to have the correct processes and documentation in place to reduce load on the GBIF Helpdesk, however, when in doubt about something please don't hesitate to write [email protected] with your questions.

Thanks again for your help.

@dvdscripter
Copy link

Hi @diogok and @dgasl nice to see this discussion going on. Sadly I already made an in-house solution for this problem at SiBBr. Our proxy API (we didn't change IPT) can generate meta.xml, process publishing on IPT and reports IPT from all patterns. So my problem is solved but I still see this feature as GBIF Community gain.

@diogok
Copy link
Author

diogok commented Feb 23, 2017

Thank you all for the comments.

@kbraak The API proposed here indeed overlaps with the "registered dataset inventory" and I was not aware of that at time of writing.

The main addition is that it exposes more complete data and metadata, and also exposes resources not registered/published on GBIF (local only resources).

My use case was a bit of some of the points raised by @dgasl:

  • We had rapidly changing datasets (ongoing work)
  • Some were part of a bigger set
  • As such most were not due to publication on GBIF

Since our data was already on IPT for documenting, publishing and archival, it would be great to also use IPT as an integration tool, combined with the ability to download file from inside a resource's DwC-A;

@abubelinha
Copy link
Contributor

Hi @diogok
This looks very interesting to me, although I am not a programmer so I do not really get the idea of what a github pull is.

Does all this conversation mean that you have implemented a local API for your IPT server, but gbif did not include it in their official version yet?
If so, are there any chances to see how it works in your IPT installation?

Thanks a lot in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants