Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Origin code not supported" error #256

Closed
geotheory opened this issue Dec 23, 2020 · 12 comments
Closed

"Origin code not supported" error #256

geotheory opened this issue Dec 23, 2020 · 12 comments

Comments

@geotheory
Copy link

Some fields are supported in the default internal dictionary but not others?

require(countrycode)
#> Loading required package: countrycode
x =  c("Spain","Greece","Bulgaria","Romania","Albania","Malta","Italy","France","Netherlands","United Kingdom")

guess_field(x)
#>                                    code percent_of_unique_matched
#> country.name.en         country.name.en                       100
#> cow.name                       cow.name                       100
#> cldr.name.en               cldr.name.en                       100
#> cldr.name.en_001       cldr.name.en_001                       100
#> cldr.name.en_au         cldr.name.en_au                       100
#> cldr.name.fil             cldr.name.fil                       100
#> cldr.name.luo             cldr.name.luo                       100
#> cldr.name.sn               cldr.name.sn                       100
#> cldr.variant.en         cldr.variant.en                       100
#> cldr.variant.en_001 cldr.variant.en_001                       100
#> cldr.variant.en_au   cldr.variant.en_au                       100
#> cldr.variant.fil       cldr.variant.fil                       100
#> cldr.variant.luo       cldr.variant.luo                       100
#> cldr.variant.sn         cldr.variant.sn                       100
#> iso.name.en                 iso.name.en                        90
#> p4.name                         p4.name                        90
#> un.name.en                   un.name.en                        90
#> vdem.name                     vdem.name                        90
#> cldr.short.en             cldr.short.en                        90
#> cldr.short.en_001     cldr.short.en_001                        90
#> cldr.short.en_au       cldr.short.en_au                        90
#> cldr.short.fil           cldr.short.fil                        90
#> cldr.short.luo           cldr.short.luo                        90
#> cldr.short.sn             cldr.short.sn                        90

countrycode(x, 'country.name.en', 'iso2c')
#>  [1] "ES" "GR" "BG" "RO" "AL" "MT" "IT" "FR" "NL" "GB"

countrycode(x, 'cow.name', 'iso2c')
#> Error in countrycode(x, "cow.name", "iso2c"): Origin code not supported by countrycode or present in the user-supplied custom_dict.
@cjyetman
Copy link
Collaborator

see ?countrycode::codelist

'cow.name' is a "Destination only" code. That being said, that could be made clearer (if not completely removed) in the output of guess_field() possibly.

for soe discussion of why it is a destination only code, see for instance #179

@cjyetman
Copy link
Collaborator

btw... nice to see guess_field() being used "in the wild" 😄

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Dec 23, 2020

just to clarify, you should just use "country.name" as origin. It should work.

I'm now thinking that this confusion might be encouraged by guess_field. Users don't actually need to know what name format is used. We do all the guessing work for them using regular expressions.

@vincentarelbundock
Copy link
Owner

maybe we should issue a warning when names are near the top of candidates.

@cjyetman
Copy link
Collaborator

true... as I understood it, the original intention was to give likely candidates for an origin code that would cover most/all of a given input vector... that leads me to...

  1. it shouldn't ever suggest a destination only code
  2. "codes" that can use regex are largely irrelevant as suggestions here
  3. country name "codes" seem pretty weird here (though we do have an example of using it that way in the help file?!?)

@vincentarelbundock
Copy link
Owner

Yeah, removing them from the results would probably make sense. Would be nice if we could return "country.name" (and "country.name.de") instead.

But I'm off to cook the Joulukinkku now!

@geotheory
Copy link
Author

How does providing "country.name" work then? It's not a field in countrycode::codelist.

I thought the function was a sort of universal translator from any name/code into any other. I'm unclear what is gained by adding restrictions to that - but then nor have I been through the full thought process that you guys have..

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Dec 23, 2020

The problem with using a specific long form of country name, is that they are not "standardized". The organizations that publish them do not usually "stand by" them in the same way they do for numeric or alpha code (the unicode org is an exception). So even the Correlates of War organization might spell the DRC differently in different publications. So if even a given vector of names really looks like cow.name, there is no guarantee that there won't be small variations to it in future (or longer) vectors. Moreover, slight differences in encoding can really mess things up, and a comma or apostrophe too.

I've been dealing with these codes for a while now, and IMO the better solution is to always merge based on shorter/standardized numeric or alpha codes.

When you call countrycode with country.name as "origin", the function will use regular expressions to detect the country name and assign it a unique identifier. Those regular expressions have been developed over the last 10 years, and they are thoroughly tested. They may not be perfect, but they are quite good and general, so they can detect variations of "North Korea", "People's republic of NK", etc.

So yeah, this is certainly an opinionated design choice, but I'm quite convinced that it is the right choice.

Of course, if you really want to use country.names as an "origin" code, it would be easy to select the two columns you want in the countrycode::codelist data.frame, and then use the merge command to insert the new column in your dataset.

@vincentarelbundock
Copy link
Owner

Example:

library(countrycode)
countrycode(
  c("Democratic republic of Congo", "Algeria", "USA"),
  "country.name",
  "iso3c")
#> [1] "COD" "DZA" "USA"

@cjyetman
Copy link
Collaborator

How does providing "country.name" work then? It's not a field in countrycode::codelist.

If I understood your question correctly, it works because of a hard-coded switch within the function that changes "country.code" to "country.code.en" if origin == 'country.name'

and further, there's a hard-coded switch within the function that sets origin_regex = TRUE if the origin code is "country.name.en" or "country.name.de", and then changes either of those to "country.name.en.regex" or "country.name.de.regex"

# Regex naming scheme
if (is.null(custom_dict)) { # only for default dictionary
# English regex is default
if (origin == 'country.name') {
origin <- 'country.name.en'
}
if (destination == 'country.name') {
destination <- 'country.name.en'
}
# .regex extension in dictionary colnames
if (origin %in% c('country.name.en', 'country.name.de')) {
origin <- paste0(origin, '.regex')
origin_regex <- TRUE
} else {
origin_regex <- FALSE
}
}

If I remember correctly, the reasoning behind that was...

  1. maintain backward compatibility (where "country.name" was the one and only regex column in codelist or the equivalent)
  2. enable custom dictionaries to also use regex matching, but not without disrupting the more standard usage

@cjyetman
Copy link
Collaborator

I thought the function was a sort of universal translator from any name/code into any other. I'm unclear what is gained by adding restrictions to that - but then nor have I been through the full thought process that you guys have..

If one would like to do exact matching on one of the name columns of codelist, technically they could easily achieve that by setting custom_dict = codelist, but since the built-in regexes have been maintained and improved for 10+ years, your more likely to get a better result from using them than a snapshot of exact name matches from a certain point in that code's history.

library(countrycode)
cow_names <- na.omit(codelist$cow.name)
from_cow <- countrycode(cow_names, "cow.name", "iso2c", custom_dict = codelist)
#> Warning in countrycode(cow_names, "cow.name", "iso2c", custom_dict = codelist): Some values were not matched unambiguously: Austria-Hungary, Baden, Bavaria, Czechoslovakia, German Democratic Republic, Hanover, Hesse Electoral, Hesse Grand Ducal, Kosovo, Mecklenburg Schwerin, Modena, Parma, Republic of Vietnam, Saxony, Tuscany, Two Sicilies, Wuerttemburg, Yemen Arab Republic, Yemen People's Republic, Yugoslavia, Zanzibar
from_std <- countrycode(cow_names, "country.name", "iso2c")
#> Warning in countrycode(cow_names, "country.name", "iso2c"): Some values were not matched unambiguously: Austria-Hungary, Baden, Bavaria, Czechoslovakia, German Democratic Republic, Hanover, Hesse Electoral, Hesse Grand Ducal, Kosovo, Mecklenburg Schwerin, Modena, Parma, Republic of Vietnam, Saxony, Tuscany, Two Sicilies, Wuerttemburg, Yemen Arab Republic, Yemen People's Republic, Yugoslavia, Zanzibar
identical(from_cow, from_std)
#> [1] TRUE

@geotheory
Copy link
Author

Thanks for the explainers @vincentarelbundock @cjyetman. I agree it feels sensible to exlude 'destination' schemes from the guess_field() results.

By the way you'd be surprised by how often I use guess_field(). I encounter new unspecified naming/coding systems almost every other day, and this function makes data joins much quicker. But until now I also didn't realise the complexity under the hood of the countrycode function. This is a pretty critical package :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants