Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default geolocation rules #1744

Merged
merged 26 commits into from
Feb 11, 2025
Merged

Add default geolocation rules #1744

merged 26 commits into from
Feb 11, 2025

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Feb 4, 2025

Description of proposed changes

Add default geolocation rules that can be used for augur curate apply-geolocation-rules. This PR only adds the default rules, but does not incorporate them into the augur curate command. I plan to do that separately.

The rules were originally copied from ncov-ingest and then sorted and modified to match places to the geographic location rather than diplomatic semantics (as discussed on Slack).

Welcome any suggestions on organization and any sharp eyes that can flag weird rules that should be removed/modified.

Related issue(s)

Resolves #1488

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@joverlee521 joverlee521 linked an issue Feb 4, 2025 that may be closed by this pull request
@joverlee521
Copy link
Contributor Author

Thanks @kimandrews for flagging the USA territories! Will update rules for those in new commits.

@joverlee521 joverlee521 marked this pull request as draft February 5, 2025 01:39
Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: didn't look at augur/data/geolocation_rules.tsv because it's too big to load in Github and I'm too lazy to check out the branch locally at the moment.


MIT License

Copyright (c) 2020 Nextstrain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably strip the year here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know. Also added ncov-ingest to the checklist in the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually how would this work?
The original copyright still includes the year, so should this copy still include the year until the original removes it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, license file should reflect the original. I'd leave as-is for this PR.

country New Zealand -41.500083 172.8344077
country Nicaragua 12.6090157 -85.2936911
country Niger 18.061703 10.145544
country Nigeria 10.0 8.0
country North Macedonia 41.6171214 21.7168387
country Northern Mariana Islands 14.149020499999999 145.21345248318923
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any value to trying to truncate some of these to a standard number of significant digits? Picking this one particularly because that latitude looks a whole lot like a floating-point representation issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lat/longs are included in the exported Auspice JSON if the dataset has a matching location. So if we truncate these, it could potentially decrease the size of the Auspice JSONs.

Let's say, on average we truncate by 5 digits per value, so that's 10 bytes per location. If a dataset included all of the locations of this lat_longs.tsv, the max reduction is 10 * 3213 = 32,130 bytes or 0.03213 MB.
Probs not worth it 🤷‍♀️

@joverlee521 joverlee521 force-pushed the default-geolocation-rules branch from 17b54ae to 53feb1c Compare February 5, 2025 20:32
@joverlee521 joverlee521 marked this pull request as ready for review February 5, 2025 21:54
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't scanned the actual TSV very much but I'm very happy to see both the inclusion of this data within augur itself and the changes to "country" to better reflect geographic location on the map

@victorlin victorlin added the breaking Makes a backwards incompatible change and should wait for major release label Feb 8, 2025
This commit copies the latest geolocation rules from ncov-ingest¹
which we've been using as our "central" geolocation rules across
pathogen ingest workflows. The subsequent commits will modify the rules
to remove GISAID specific entries and incorporate the rules as the
default rules for the `augur curate apply-geolocation-rules` command.

¹ <https://github.com/nextstrain/ncov-ingest/blob/71ff771dc83ca5c5d14ea6de70132ea2e52a2ab6/source-data/gisaid_geoLocationRules.tsv>
Sort geolocation rules by the annotated country to make manual curation
easier.

I did this using visidata and the replayable commands are saved at
<https://gist.github.com/joverlee521/6d257c5b5045dd928dd374556baada9a>
Add clear demarcation of sections of geolocation rules
Within each section, the rules are sorted by alphabetical order of the
raw geolocation values.

Subsequent commits will clean up the general geolocation rules
since they seem to be specific to the ncov-ingest data.
Remove multiple general geolocation rules that are specific to the
ncov-ingest data.

1. Removed rule matching a date string `24 de Diciembre`
2. Removed rules that use abbreviations that seem too general to
match to specific locations.
3. Removed the general rule for "Milwaukee" because there are multiple
locations named Milwaukee and not all of them are counties.
Part of effort to match places to the geographic location rather than
diplomatic semantics.¹ Similar to the geolocation rules used in
rabies,² match "French Guiana" to its geographic location:

    region = South America
    country = French Guiana
    division = French Guiana

¹ <https://bedfordlab.slack.com/archives/CFEU0GNNS/p1736375470477489>
² <nextstrain/rabies#22>
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Réunion" to its geographic location:

    region = Africa
    country = Réunion
    division = Réunion
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Guadeloupe" to its geographic location:

    region = North America
    country = Guadeloupe
    division = Guadeloupe
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Martinique" to its geographic location:

    region = North America
    country = Martinique
    division = Martinique
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Mayotte" to its geographic location:

    region = Africa
    country = Mayotte
    division = Mayotte
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "New Caledonia" to its geographic location:

    region = Oceania
    country = New Caledonia
    division = New Caledonia
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Wallis and Futuna" to its geographic location:

    region = Oceania
    country = Wallis and Futuna
    division = Wallis and Futuna

Also consolidates "Wallis and Futuna" and "Wallis and Futuna Islands".
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Canary Islands" to its geographic location:

    region = Africa
    country = Canary Islands
    division = Canary Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Sint Eustatius" to its geographic location:

    region = North America
    country = Sint Eustatius
    division = Sint Eustatius

Also updates location to use the fully spelled out name instead of
the abbreviated "St Eustatius" to match other locations.
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Anguilla" to its geographic location:

    region = North America
    country = Anguilla
    division = Anguilla
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "British Virgin Islands" to its geographic location:

    region = North America
    country = British Virgin Islands
    division = British Virgin Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Cayman Islands" to its geographic location:

    region = North America
    country = Cayman Islands
    division = Cayman Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Montserrat" to its geographic location:

    region = North America
    country = Montserrat
    division = Montserrat
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Turks and Caicos Islands" to its geographic location:

    region = North America
    country = Turks and Caicos Islands
    division = Turks and Caicos Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "American Samoa" to its geographic location:

    region = Oceania
    country = American Samoa
    division = American Samoa
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Guam" to its geographic location:

    region = Oceania
    country = Guam
    division = Guam
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Northern Mariana Islands" to its geographic location:

    region = Oceania
    country = Northern Mariana Islands
    division = Northern Mariana Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Puerto Rico" to its geographic location:

    region = North America
    country = Puerto Rico
    division = Puerto Rico
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "US Virgin Islands" to its geographic location:

    region = North America
    country = US Virgin Islands
    division = US Virgin Islands
Adds unit test to check the default geolocation rules are "clean".

- flag duplicate rules that can overwrite each other
- flag rules that do not change values, i.e. raw == annotation
- flags "cyclic" rules where the annotation exists in the raw values and
the raw values exists in the annotations

This already flagged problematic rules in the file that will be fixed in
the following commit.
Removes unnecessary geolocation rules where the raw values are the
same as the annotations. These were flagged by the unit test added in
the previous commit.
@joverlee521 joverlee521 force-pushed the default-geolocation-rules branch from 96298c0 to 58eade9 Compare February 10, 2025 19:56
@joverlee521
Copy link
Contributor Author

Rebased to fix merge conflicts in the changelog.

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.06%. Comparing base (eb8c8ac) to head (58eade9).
Report is 27 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1744   +/-   ##
=======================================
  Coverage   73.06%   73.06%           
=======================================
  Files          79       79           
  Lines        8335     8335           
  Branches     1700     1700           
=======================================
  Hits         6090     6090           
  Misses       1958     1958           
  Partials      287      287           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joverlee521 joverlee521 merged commit fa13f29 into master Feb 11, 2025
36 checks passed
@joverlee521 joverlee521 deleted the default-geolocation-rules branch February 11, 2025 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Makes a backwards incompatible change and should wait for major release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Curate standard geolocation rules
4 participants