Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index / Reindex institutions from OpenGeoMetadata into EarthWorks #591

Closed
6 tasks done
mejackreed opened this issue Feb 25, 2020 · 17 comments
Closed
6 tasks done

Index / Reindex institutions from OpenGeoMetadata into EarthWorks #591

mejackreed opened this issue Feb 25, 2020 · 17 comments
Assignees
Labels

Comments

@mejackreed
Copy link
Contributor

mejackreed commented Feb 25, 2020

Currently this is handled using a variety of scripts.

From OGP data: https://github.com/sul-dlss/earthworks/tree/master/scripts/indexing/ogp

And in rake tasks: https://github.com/sul-dlss/earthworks/blob/master/lib/tasks/earthworks.rake#L115-L153

To complete this ticket we should probably do the following:

  • validate that the current data sources are once we want to continue using
  • add any new data sources
  • design a new tool/process if we need to handle this indexing punting on this for now
  • index in local dev
  • index in stage & test it
  • index in prod
@mejackreed
Copy link
Contributor Author

mejackreed commented Jun 30, 2020

@jkeck has already done some work to reindex stage and prod for OpenGeoMetadata records (not OGP). Lets scope out of this ticket OGP.

Maybe the next step is to add and evaluate into stage:

ca.frdr.geodisy
edu.cornell
edu.uarizona

And get feedback from the GIS team.

@mejackreed mejackreed self-assigned this Jul 2, 2020
@mejackreed
Copy link
Contributor Author

There were some issues with indexing the edu.uarizona data, so I think maybe only edu.cornell is what we should be looking at.

@thatbudakguy
Copy link
Member

via @kimdurante on #836:

Many of the records in EarthWorks from OpenGeoMetadata are out of date. External institutions have added/updated their metadata since our last harvest, resulting in data showing as unavailable. Example from Harvard's collection:

https://earthworks.stanford.edu/catalog/harvard-g8964-a2-1876-l3

This is also true for other institutions (NYU, Princeton, etc.). This was reported to me by Harvard staff who were hearing from end users about not being able to access content through EarthWorks. They said it was not extremely urgent.

A regularly-scheduled harvest of content from OpenGeoMetadata would be ideal. However, a one-time refresh would suffice for now. We will be looking at changes in the coming year related to the metadata schema, so perhaps we can look into any changes related to the harvesting process at that time as well.

Thanks! Please let me know if you need any other information.

@thatbudakguy
Copy link
Member

currently we're set up to index these institutions:

edu.berkeley
edu.columbia
edu.nyu
edu.princeton.arks
edu.cornell
big-ten
edu.virginia

@kimdurante are we interested in indexing from any other institutions? it looks like harvard has a lot of records available, and i also see listings in OGM for penn and penn state along with several other institutions.

@kimdurante
Copy link
Collaborator

Hi @thatbudakguy
Is it possible to index all institutions?
We should refresh Harvard's metadata for sure, what we have currently is outdated.
Thanks so much for doing this!

@thatbudakguy thatbudakguy pinned this issue Jun 16, 2022
@thatbudakguy
Copy link
Member

@kimdurante definitely! would you want to just update the institutions we currently hold, or also pull in newer ones from OGM like Iowa, George Mason, Arizona, etc.?

@kimdurante
Copy link
Collaborator

I'd say all of them - old & new.
I'm happy to look over the reharvested metadata once it's done.

@kimdurante
Copy link
Collaborator

index-all-the-things-1200x630

@thatbudakguy
Copy link
Member

@kimdurante ok, I've reharvested everything (except stanford records) and the result is visible in staging at https://earthworks-stage.stanford.edu/. let me know how it looks.

@thatbudakguy
Copy link
Member

@kimdurante did you get a chance to look at staging? maybe this notification got lost

@kimdurante
Copy link
Collaborator

Hi @thatbudakguy! Sorry, I did not respond sooner. I am out until Thursday, I will report back then. Thanks!

@kimdurante
Copy link
Collaborator

Looks great to me @thatbudakguy
Thanks so much for doing this! There are some records from outside institutions which are still showing as not available, but I do not think this is on our end. I will give a more detailed update at our meeting on Wednesday.

@thatbudakguy thatbudakguy moved this from Ready to Blocked in Geo Workcycles 2024 Nov 9, 2022
@thatbudakguy
Copy link
Member

blocked by OpenGeoMetadata/GeoCombine#130 because we don't want to index aardvark metadata and get a ton of errors while reindexing.

@thatbudakguy thatbudakguy moved this from Blocked to Ready in Geo Workcycles 2024 Jan 26, 2023
@thatbudakguy
Copy link
Member

making some notes for myself as I test indexing in staging:

  • can we just get rid of OGP records?
  • how do we fix long institution/provenance names? e.g. "UW-Madison Robinson Map Library" -> "Wisconsin"
  • how do we handle the "fill in your values here" PolicyMap records in the shared repository (see also Reharvest Metadata for Web Maps #847)?
  • how do we make sure to get records that are only in people's GeoBlacklight instances (e.g. MIT)?
  • what to do about MassGIS?

@thatbudakguy thatbudakguy moved this from Ready to In Progress in Geo Workcycles 2024 Jan 26, 2023
@thatbudakguy
Copy link
Member

thatbudakguy commented Jan 27, 2023

notes from meeting with @kimdurante:

@thatbudakguy
Copy link
Member

thatbudakguy commented Mar 29, 2023

update: massGIS contacted; they are looking into the possibility of hosting on OGM. for now, we will live without their records.

remaining tasks:

  • merge and cut a release for GeoCombine to ensure our harvesting/indexing logic is bug-free
  • point earthworks at released GeoCombine (and see if
    # required to avoid https://github.com/actions/runner-images/issues/37
    # because faraday depends on patron, which requires curl headers to build
    - name: Install cURL Headers
    run: sudo apt-get update && sudo apt-get install libcurl4-openssl-dev
    can be removed again?)
  • create a new production core in sul-dlss/solr-configs (Earthworks config updates sul-solr-configs#258)
  • index Stanford records into the new production core
  • index OGM records into the new production core
  • point staging at the new production core temporarily so the GIS team can confirm it looks good
  • point UAT and prod at the new production core. profit!

thatbudakguy added a commit to sul-dlss/sul-solr-configs that referenced this issue Mar 29, 2023
This is preparation for building a new production core for EW.
The core will be populated with Stanford data and a fresh reindex
of OpenGeoMetadata records, then swapped into prod and UAT.

Ref sul-dlss/earthworks#591
cbeer pushed a commit to sul-dlss/sul-solr-configs that referenced this issue Mar 29, 2023
This is preparation for building a new production core for EW.
The core will be populated with Stanford data and a fresh reindex
of OpenGeoMetadata records, then swapped into prod and UAT.

Ref sul-dlss/earthworks#591
@thatbudakguy
Copy link
Member

updates are live!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants