Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate NCBI geolocation metadata #3105

Open
Tracked by #2788
anna-parker opened this issue Oct 29, 2024 · 1 comment
Open
Tracked by #2788

Curate NCBI geolocation metadata #3105

anna-parker opened this issue Oct 29, 2024 · 1 comment
Assignees

Comments

@anna-parker
Copy link
Contributor

anna-parker commented Oct 29, 2024

This curation was also requested by users.

NCBI’s geolocation field is of the format: country:subdivision according to their documentation it should be possible to further split subdivision into geoLocAdmin1, geoLocAdmin2. However, for a large number of samples the division field does not follow this format. Currently we map country to our geoLocCountry field and the entire division field into geoLocAdmin1. We would like to be able to at least split this into geoLocAdmin1 and geoLocAdmin2.

Discussion

From discussion in group these changes should actually be made by by a curator bot
https://docs.google.com/document/d/1TQiE66Hk6WjgkMvvMMhu8uMSEhHFP2AwINMAIsyODCw/edit

Proof of concept of suggestions in ingest

#3015
#3026

@anna-parker anna-parker changed the title At the moment we just split ncbi's ncbiGeoLocation field into division = geoLocAdmin1 and country=geoLocCountry. The division could still be further split into geoLocAdmin1 and geoLocAdmin2, e.g. Curate NCBI geolocation metadata Oct 29, 2024
@anna-parker anna-parker self-assigned this Oct 29, 2024
@corneliusroemer
Copy link
Contributor

The most long term high level decision is on what to normalize to. Ideally we normalize to an established gazetteer that unambiguously allows pulling things like coordinates, and potentially other metadata for free. This means we'd possibly actually want to normalize to an id, instead of our current four location fields (these would semi-automatically derive from the entity pointed to).

Another key decision is what we mean by admin level 1/2 and city for each country.

A target gazetter should be open, allow dump download (to be free of rate limits), have stability track record, be maintained.

Geonames is one option, OpenStreetmap another.

The actual how to should be an implementation detail, which isn't to devalue it, just to beware not to let convenient implementation drive high level permanent decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants