Enable Chinese as origin language #316

turbanisch · 2022-08-25T19:32:56Z

So far, there seems to be no way of converting country names from Chinese, not even by left-joining any of the dataframes that come with countrycode. I assume the reason for this is the lack of a corresponding set of regexes? If there is any interest, I would like to offer to come up with one!

vincentarelbundock · 2022-08-25T19:49:28Z

If you can come up with regexes, that would lovely!

But have you tried the countryname function? If so, what problem did you run into?

cjyetman · 2022-08-25T19:50:37Z

Since {countrycode} does include CLDR names in Chinese, one could (rather easily) create a custom dictionary using the included codelist to achieve exact matching, though it will not do fancy regex matching, e.g...

library(countrycode)

custom_dict <-
  unique(countrycode::codelist[, c("cldr.short.zh", "country.name.en", "iso3c")])

countrycode(
  "阿拉伯联合酋长国",
  origin = "cldr.short.zh",
  destination = "country.name.en",
  custom_dict = custom_dict
)
#> [1] "United Arab Emirates"

countrycode(
  "阿拉伯联合酋长国",
  origin = "cldr.short.zh",
  destination = "iso3c",
  custom_dict = custom_dict
)
#> [1] "ARE"

But yes, fancy Chinese regexes would be very interesting!

turbanisch · 2022-08-25T22:38:49Z

Ah, good idea, @cjyetman ! I think eventually it would be nice to have a more lenient regex matching because even data from Chinese authorities is messy, for some countries it may list the full country name, for others just an abbreviated one.

I had quickly discarded countryname but apparently I just picked the wrong examples: neither Germany nor France have Chinese versions and the one for China matches only the full country name (aka "People's Republic of China"):

library(tidyverse)
library(countrycode)

# no matches
countryname("德国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 德国
#> [1] NA
countryname("中国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 中国
#> [1] NA

# there are Chinese country names in the data but not for all countries
# some match the official country name and may be too strict
countrycode::countryname_dict %>% 
  as_tibble() %>% 
  filter(str_detect(country.name.alt, "\\p{script=Han}")) %>% 
  filter(country.name.en %in% c("Germany", "France", "China"))
#> # A tibble: 2 × 2
#>   country.name.en country.name.alt
#>   <chr>           <chr>           
#> 1 China           中华人民共和国  
#> 2 China           中華人民共和國

^{Created on 2022-08-26 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Monterey 12.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2022-08-26
#>  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom         1.0.0   2022-07-01 [1] CRAN (R 4.2.0)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  countrycode * 1.4.0   2022-05-04 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr        2.2.0   2022-06-05 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr          1.4.3   2022-05-04 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.8.0   2022-07-18 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.4   2022-07-12 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  tibble      * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

turbanisch · 2022-09-25T22:33:40Z

I took a bit of a deep dive and came with regular expressions for Chinese by scraping Wikipedia. I outlined the issue (in short: Chinese has many variants, not just simplified vs. traditional scripts but also depending on regions) here and implemented a function as a proof of concept here.

Do let me know if you would like to incorporate the regular expressions into countrycode and I will try to come up with a PR! I assume this is how I would have to prepare the codes? https://github.com/vincentarelbundock/countrycode#adding-a-new-code

There is one issue though that you might want to handle differently than I did: my function converts the input into simplified characters before matching via regular expressions. I discuss alternative implementations in the README mentioned above (https://github.com/turbanisch/chinese-countryname-regex). In short, for maintenance reasons I would suggest only adding the regular expresssions for simplified Chinese and perhaps add a note in the countrycode documentation telling the user to apply a traditional-to-simplified conversion herself as needed.

cjyetman · 2022-09-26T15:39:15Z

@turbanisch this is really cool, thanks for all your effort so far

just a suggestion if/when you start building a PR... I would suggest following the model of adding known variants of each country to a data file for the tests, and then testing each one of them in a {testthat} test, as done here...
https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/data-known-name-variations.R
https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/test-known-name-variations.R

vincentarelbundock · 2022-09-27T21:37:15Z

That all sounds great! Thanks for putting this together.

I wont be able to look at anything in the near term, but feel free to open a PR whenever you are ready, and Ill try to review when possible.

Also agree with the know variations tests. Those would be super useful.

turbanisch · 2022-09-27T22:51:47Z

Thanks a lot to both of you! I will see how far I get and prepare a PR in the coming days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Chinese as origin language #316

Enable Chinese as origin language #316

turbanisch commented Aug 25, 2022

vincentarelbundock commented Aug 25, 2022

cjyetman commented Aug 25, 2022 •

edited

Loading

turbanisch commented Aug 25, 2022

turbanisch commented Sep 25, 2022

cjyetman commented Sep 26, 2022

vincentarelbundock commented Sep 27, 2022

turbanisch commented Sep 27, 2022

Enable Chinese as origin language #316

Enable Chinese as origin language #316

Comments

turbanisch commented Aug 25, 2022

vincentarelbundock commented Aug 25, 2022

cjyetman commented Aug 25, 2022 • edited Loading

turbanisch commented Aug 25, 2022

turbanisch commented Sep 25, 2022

cjyetman commented Sep 26, 2022

vincentarelbundock commented Sep 27, 2022

turbanisch commented Sep 27, 2022

cjyetman commented Aug 25, 2022 •

edited

Loading