Uniquely identify Russia vs. Soviet Union #180

vincentarelbundock · 2018-06-03T14:47:02Z

This is an email I just got.

still a big fan of countrycode! :-) One small thing I noticed in a BIS dataset that uses both the Soviet Union (with empty values until today, don't ask me why... -.- ) and Russia:

countrycode("Soviet Union", "country.name", "iso3c")
countrycode("Russia", "country.name", "iso3c")

both return "RUS" for me. Took me about 45 minutes to find that bug as I never fathomed that my data would look like that and that countrycode wouldn't uniquely identify the two. I'm in the middle of an update and the data changed to inclue the SU which is why I wasn't careful enough. Still, maybe that could be changed in your next update to uniquely identify the two?

All the best and thanks for the package!

Also related to this issue: #179

Jakobian123 · 2018-06-03T18:49:16Z

thanks for posting this! In your reply, you wrote

This is a tricky one. The Soviet Union doesn't have an ISO code, obviously. Strictly speaking, it should probably return NA.

I agree with that. Especially, since mapping different countries into the same iso can create all sorts of problems (it did for me). In effect, we're mapping 15 countries into one country here ;-) An "NA" is more likely to create immediate problems which the researcher can address. "Prussia" for example is not matched so this would be consistent:

countrycode("Prussia","country.name","iso3c")

returns an error. I also would have expected this code:

countrycode(c("Russia","Soviet Union"),"country.name", "iso3c")

to produce the usual warning about no unique mapping instead of quietly creating c("RUS", "RUS") but maybe I'm wrong. If it is interesting: for me personally (don't know if this is feasible) the ideal behaviour of country.code would have been something like:

countrycode(c("Prussia", "Soviet Union"), "country.name", "iso3c")

"Warning message:
In countrycode(c("Prussia", "Soviet Union"), "country.name", "iso3c") :
Some values do not have an equivalent in the target code system: Prussia, Soviet Union"

maybe such a warning could be extended to your Germany issue or all historical countries?

vincentarelbundock · 2018-06-03T18:53:25Z

I kind of agree that behavior should be stricter. Kind of like na.rm = FALSE by default.

cjyetman · 2018-06-04T09:01:59Z

According to Wikipedia, USSR has an "exceptionally reserved code" SUN ("From June 2008; Transitionally reserved from September 1992; Officially assigned before being deleted from ISO 3166-1" in iso3c, and a "withdrawn code" 810 in iso3n. It was decided that not including withdrawn codes is "formally correct behavior" in #155, so it seems NA should be the correct behavior in this case.

But this still leaves open the possibility that some code sets view a former country the same as a current country, as with West Germany being discussed in #179, so...

having separate rows in the codelist for USSR and Russia might cause some code sets to have duplicate codes on different rows, which causes problems
having the Russia regex not match "USSR" and/or "Soviet Union" might break code sets that would/should give the same result for Russia and USSR

cjyetman · 2018-06-04T09:06:25Z

maybe we could add a code set specific tie breaker column for code sets that don't distinguish between some countries that other code sets do... and if an origin code has multiple matches, the tie breaker column would determine which to use? (also relevant to #179 )

vincentarelbundock · 2018-06-04T11:39:21Z

having separate rows in the codelist for USSR and Russia might cause some code sets to have duplicate codes on different rows, which causes problems

Yep. For example, CoW only uses RUS to represent both. So we'd need a single regex for CoW, but separate regexes for iso3c.

Would having different rows with tie breaks mess with our dictionary building process? I'm worried we'll add hacks on top of hacks...

cjyetman · 2018-06-04T12:09:33Z

I haven't totally thought that through, but...

I would imagine something like there being a CSV/data.frame that everything begins from... that has a country.name.en.regex column, plus, for example, columns like cowc_tiebreaker where necessary... and everything else would get merged into it based on the country.name.en.regex. cowc_tiebreaker could be a logical, and if you did countrycode("RUS", "cowc", "country.name"), the function would check for a cowc_tiebreaker column, and if it exists and a result matches two rows ([cowc:"RUS", country.name.en.regex:"\brussia", cowc_tiebreaker:TRUE] and [cowc:RUS, country.name.en.regex:"\bsoviet.?union|u\.?s\.?s\.?r|socialist.?republics", cowc_tiebreaker:FALSE]), it would return the one where cowc_tiebreaker == TRUE.

It would definitely require some adaptation of the current build script, but I'm unsure how serious or complicated it would be. It would also require a major review of all the current code sets to determine which ones would require it and for which countries, though I suppose code sets could be adapted as the need was realized.

vincentarelbundock · 2018-06-04T12:26:03Z

In a way, that's not so different than before, when the regex conversion would just run iteratively and arbitrarily end up using the last regex in the CSV. (though obviously, with some more warnings and such.)

cjyetman · 2018-06-04T13:01:51Z

true, but a bit more explicit than depending on the order of the rows

it could also be based on a iso3c_include type column for every origin/destination code, and then the function would always filter the codelist for rows where [origin]_include == TRUE before matching. country.name.en.regex_include would be TRUE for every row, but cowc_include would be FALSE for the USSR row, so you would only get one match for countrycode("RUS", "cowc", "country.name") ("Russia"), while both countrycode::countrycode("Russia", "country.name", "cowc") and countrycode::countrycode("USSR", "country.name", "cowc") would return "RUS".

vincentarelbundock · 2018-06-04T13:13:14Z

unfortunately, my countrycode bandwidth is exhausted for a while (need to get papers ready for summer conferences), but I'll keep thinking about tall this.

vincentarelbundock mentioned this issue Jul 23, 2018

Many-to-One mappings #186

Open

cjyetman added match issue many-to-one issue labels Oct 27, 2018

vincentarelbundock closed this as completed Feb 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniquely identify Russia vs. Soviet Union #180

Uniquely identify Russia vs. Soviet Union #180

vincentarelbundock commented Jun 3, 2018 •

edited

Loading

Jakobian123 commented Jun 3, 2018 •

edited

Loading

vincentarelbundock commented Jun 3, 2018

cjyetman commented Jun 4, 2018

cjyetman commented Jun 4, 2018

vincentarelbundock commented Jun 4, 2018

cjyetman commented Jun 4, 2018

vincentarelbundock commented Jun 4, 2018

cjyetman commented Jun 4, 2018

vincentarelbundock commented Jun 4, 2018

Uniquely identify Russia vs. Soviet Union #180

Uniquely identify Russia vs. Soviet Union #180

Comments

vincentarelbundock commented Jun 3, 2018 • edited Loading

Jakobian123 commented Jun 3, 2018 • edited Loading

vincentarelbundock commented Jun 3, 2018

cjyetman commented Jun 4, 2018

cjyetman commented Jun 4, 2018

vincentarelbundock commented Jun 4, 2018

cjyetman commented Jun 4, 2018

vincentarelbundock commented Jun 4, 2018

cjyetman commented Jun 4, 2018

vincentarelbundock commented Jun 4, 2018

vincentarelbundock commented Jun 3, 2018 •

edited

Loading

Jakobian123 commented Jun 3, 2018 •

edited

Loading