Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniquely identify Russia vs. Soviet Union #180

Closed
vincentarelbundock opened this issue Jun 3, 2018 · 9 comments
Closed

Uniquely identify Russia vs. Soviet Union #180

vincentarelbundock opened this issue Jun 3, 2018 · 9 comments

Comments

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Jun 3, 2018

This is an email I just got.

still a big fan of countrycode! :-) One small thing I noticed in a BIS dataset that uses both the Soviet Union (with empty values until today, don't ask me why... -.- ) and Russia:

countrycode("Soviet Union", "country.name", "iso3c")
countrycode("Russia", "country.name", "iso3c")

both return "RUS" for me. Took me about 45 minutes to find that bug as I never fathomed that my data would look like that and that countrycode wouldn't uniquely identify the two. I'm in the middle of an update and the data changed to inclue the SU which is why I wasn't careful enough. Still, maybe that could be changed in your next update to uniquely identify the two?

All the best and thanks for the package!

Also related to this issue: #179

@Jakobian123
Copy link

Jakobian123 commented Jun 3, 2018

thanks for posting this! In your reply, you wrote

This is a tricky one. The Soviet Union doesn't have an ISO code, obviously. Strictly speaking, it should probably return NA.

I agree with that. Especially, since mapping different countries into the same iso can create all sorts of problems (it did for me). In effect, we're mapping 15 countries into one country here ;-) An "NA" is more likely to create immediate problems which the researcher can address. "Prussia" for example is not matched so this would be consistent:

countrycode("Prussia","country.name","iso3c")

returns an error. I also would have expected this code:

countrycode(c("Russia","Soviet Union"),"country.name", "iso3c")

to produce the usual warning about no unique mapping instead of quietly creating c("RUS", "RUS") but maybe I'm wrong. If it is interesting: for me personally (don't know if this is feasible) the ideal behaviour of country.code would have been something like:

countrycode(c("Prussia", "Soviet Union"), "country.name", "iso3c")

"Warning message:
In countrycode(c("Prussia", "Soviet Union"), "country.name", "iso3c") :
Some values do not have an equivalent in the target code system: Prussia, Soviet Union"

maybe such a warning could be extended to your Germany issue or all historical countries?

@vincentarelbundock
Copy link
Owner Author

I kind of agree that behavior should be stricter. Kind of like na.rm = FALSE by default.

@cjyetman
Copy link
Collaborator

cjyetman commented Jun 4, 2018

According to Wikipedia, USSR has an "exceptionally reserved code" SUN ("From June 2008; Transitionally reserved from September 1992; Officially assigned before being deleted from ISO 3166-1" in iso3c, and a "withdrawn code" 810 in iso3n. It was decided that not including withdrawn codes is "formally correct behavior" in #155, so it seems NA should be the correct behavior in this case.

But this still leaves open the possibility that some code sets view a former country the same as a current country, as with West Germany being discussed in #179, so...

  • having separate rows in the codelist for USSR and Russia might cause some code sets to have duplicate codes on different rows, which causes problems
  • having the Russia regex not match "USSR" and/or "Soviet Union" might break code sets that would/should give the same result for Russia and USSR

@cjyetman
Copy link
Collaborator

cjyetman commented Jun 4, 2018

maybe we could add a code set specific tie breaker column for code sets that don't distinguish between some countries that other code sets do... and if an origin code has multiple matches, the tie breaker column would determine which to use? (also relevant to #179 )

@vincentarelbundock
Copy link
Owner Author

having separate rows in the codelist for USSR and Russia might cause some code sets to have duplicate codes on different rows, which causes problems

Yep. For example, CoW only uses RUS to represent both. So we'd need a single regex for CoW, but separate regexes for iso3c.

Would having different rows with tie breaks mess with our dictionary building process? I'm worried we'll add hacks on top of hacks...

@cjyetman
Copy link
Collaborator

cjyetman commented Jun 4, 2018

I haven't totally thought that through, but...

I would imagine something like there being a CSV/data.frame that everything begins from... that has a country.name.en.regex column, plus, for example, columns like cowc_tiebreaker where necessary... and everything else would get merged into it based on the country.name.en.regex. cowc_tiebreaker could be a logical, and if you did countrycode("RUS", "cowc", "country.name"), the function would check for a cowc_tiebreaker column, and if it exists and a result matches two rows ([cowc:"RUS", country.name.en.regex:"\brussia", cowc_tiebreaker:TRUE] and [cowc:RUS, country.name.en.regex:"\bsoviet.?union|u\.?s\.?s\.?r|socialist.?republics", cowc_tiebreaker:FALSE]), it would return the one where cowc_tiebreaker == TRUE.

It would definitely require some adaptation of the current build script, but I'm unsure how serious or complicated it would be. It would also require a major review of all the current code sets to determine which ones would require it and for which countries, though I suppose code sets could be adapted as the need was realized.

@vincentarelbundock
Copy link
Owner Author

In a way, that's not so different than before, when the regex conversion would just run iteratively and arbitrarily end up using the last regex in the CSV. (though obviously, with some more warnings and such.)

@cjyetman
Copy link
Collaborator

cjyetman commented Jun 4, 2018

true, but a bit more explicit than depending on the order of the rows

it could also be based on a iso3c_include type column for every origin/destination code, and then the function would always filter the codelist for rows where [origin]_include == TRUE before matching. country.name.en.regex_include would be TRUE for every row, but cowc_include would be FALSE for the USSR row, so you would only get one match for countrycode("RUS", "cowc", "country.name") ("Russia"), while both countrycode::countrycode("Russia", "country.name", "cowc") and countrycode::countrycode("USSR", "country.name", "cowc") would return "RUS".

@vincentarelbundock
Copy link
Owner Author

unfortunately, my countrycode bandwidth is exhausted for a while (need to get papers ready for summer conferences), but I'll keep thinking about tall this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants