-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default geolocation rules #1744
Conversation
Thanks @kimandrews for flagging the USA territories! Will update rules for those in new commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: didn't look at augur/data/geolocation_rules.tsv
because it's too big to load in Github and I'm too lazy to check out the branch locally at the moment.
|
||
MIT License | ||
|
||
Copyright (c) 2020 Nextstrain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably strip the year here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know. Also added ncov-ingest to the checklist in the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually how would this work?
The original copyright still includes the year, so should this copy still include the year until the original removes it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, license file should reflect the original. I'd leave as-is for this PR.
country New Zealand -41.500083 172.8344077 | ||
country Nicaragua 12.6090157 -85.2936911 | ||
country Niger 18.061703 10.145544 | ||
country Nigeria 10.0 8.0 | ||
country North Macedonia 41.6171214 21.7168387 | ||
country Northern Mariana Islands 14.149020499999999 145.21345248318923 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any value to trying to truncate some of these to a standard number of significant digits? Picking this one particularly because that latitude looks a whole lot like a floating-point representation issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lat/longs are included in the exported Auspice JSON if the dataset has a matching location. So if we truncate these, it could potentially decrease the size of the Auspice JSONs.
Let's say, on average we truncate by 5 digits per value, so that's 10 bytes per location. If a dataset included all of the locations of this lat_longs.tsv, the max reduction is 10 * 3213 = 32,130 bytes or 0.03213 MB.
Probs not worth it 🤷♀️
17b54ae
to
53feb1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't scanned the actual TSV very much but I'm very happy to see both the inclusion of this data within augur itself and the changes to "country" to better reflect geographic location on the map
This commit copies the latest geolocation rules from ncov-ingest¹ which we've been using as our "central" geolocation rules across pathogen ingest workflows. The subsequent commits will modify the rules to remove GISAID specific entries and incorporate the rules as the default rules for the `augur curate apply-geolocation-rules` command. ¹ <https://github.com/nextstrain/ncov-ingest/blob/71ff771dc83ca5c5d14ea6de70132ea2e52a2ab6/source-data/gisaid_geoLocationRules.tsv>
Sort geolocation rules by the annotated country to make manual curation easier. I did this using visidata and the replayable commands are saved at <https://gist.github.com/joverlee521/6d257c5b5045dd928dd374556baada9a>
Add clear demarcation of sections of geolocation rules Within each section, the rules are sorted by alphabetical order of the raw geolocation values. Subsequent commits will clean up the general geolocation rules since they seem to be specific to the ncov-ingest data.
Remove multiple general geolocation rules that are specific to the ncov-ingest data. 1. Removed rule matching a date string `24 de Diciembre` 2. Removed rules that use abbreviations that seem too general to match to specific locations. 3. Removed the general rule for "Milwaukee" because there are multiple locations named Milwaukee and not all of them are counties.
Part of effort to match places to the geographic location rather than diplomatic semantics.¹ Similar to the geolocation rules used in rabies,² match "French Guiana" to its geographic location: region = South America country = French Guiana division = French Guiana ¹ <https://bedfordlab.slack.com/archives/CFEU0GNNS/p1736375470477489> ² <nextstrain/rabies#22>
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Réunion" to its geographic location: region = Africa country = Réunion division = Réunion
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Guadeloupe" to its geographic location: region = North America country = Guadeloupe division = Guadeloupe
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Martinique" to its geographic location: region = North America country = Martinique division = Martinique
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Mayotte" to its geographic location: region = Africa country = Mayotte division = Mayotte
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "New Caledonia" to its geographic location: region = Oceania country = New Caledonia division = New Caledonia
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Wallis and Futuna" to its geographic location: region = Oceania country = Wallis and Futuna division = Wallis and Futuna Also consolidates "Wallis and Futuna" and "Wallis and Futuna Islands".
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Canary Islands" to its geographic location: region = Africa country = Canary Islands division = Canary Islands
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Sint Eustatius" to its geographic location: region = North America country = Sint Eustatius division = Sint Eustatius Also updates location to use the fully spelled out name instead of the abbreviated "St Eustatius" to match other locations.
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Anguilla" to its geographic location: region = North America country = Anguilla division = Anguilla
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "British Virgin Islands" to its geographic location: region = North America country = British Virgin Islands division = British Virgin Islands
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Cayman Islands" to its geographic location: region = North America country = Cayman Islands division = Cayman Islands
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Montserrat" to its geographic location: region = North America country = Montserrat division = Montserrat
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Turks and Caicos Islands" to its geographic location: region = North America country = Turks and Caicos Islands division = Turks and Caicos Islands
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "American Samoa" to its geographic location: region = Oceania country = American Samoa division = American Samoa
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Guam" to its geographic location: region = Oceania country = Guam division = Guam
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Northern Mariana Islands" to its geographic location: region = Oceania country = Northern Mariana Islands division = Northern Mariana Islands
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "Puerto Rico" to its geographic location: region = North America country = Puerto Rico division = Puerto Rico
Similar to previous commit, this is part of effort to match places to the geographic location rather than diplomatic semantics. Match "US Virgin Islands" to its geographic location: region = North America country = US Virgin Islands division = US Virgin Islands
Adds unit test to check the default geolocation rules are "clean". - flag duplicate rules that can overwrite each other - flag rules that do not change values, i.e. raw == annotation - flags "cyclic" rules where the annotation exists in the raw values and the raw values exists in the annotations This already flagged problematic rules in the file that will be fixed in the following commit.
Removes unnecessary geolocation rules where the raw values are the same as the annotations. These were flagged by the unit test added in the previous commit.
96298c0
to
58eade9
Compare
Rebased to fix merge conflicts in the changelog. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1744 +/- ##
=======================================
Coverage 73.06% 73.06%
=======================================
Files 79 79
Lines 8335 8335
Branches 1700 1700
=======================================
Hits 6090 6090
Misses 1958 1958
Partials 287 287 ☔ View full report in Codecov by Sentry. |
Description of proposed changes
Add default geolocation rules that can be used for
augur curate apply-geolocation-rules
. This PR only adds the default rules, but does not incorporate them into theaugur curate
command. I plan to do that separately.The rules were originally copied from ncov-ingest and then sorted and modified to match places to the geographic location rather than diplomatic semantics (as discussed on Slack).
Welcome any suggestions on organization and any sharp eyes that can flag weird rules that should be removed/modified.
Related issue(s)
Resolves #1488
Checklist