Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Chinese as origin language #316

Open
turbanisch opened this issue Aug 25, 2022 · 7 comments
Open

Enable Chinese as origin language #316

turbanisch opened this issue Aug 25, 2022 · 7 comments

Comments

@turbanisch
Copy link

So far, there seems to be no way of converting country names from Chinese, not even by left-joining any of the dataframes that come with countrycode. I assume the reason for this is the lack of a corresponding set of regexes? If there is any interest, I would like to offer to come up with one!

@vincentarelbundock
Copy link
Owner

If you can come up with regexes, that would lovely!

But have you tried the countryname function? If so, what problem did you run into?

@cjyetman
Copy link
Collaborator

cjyetman commented Aug 25, 2022

Since {countrycode} does include CLDR names in Chinese, one could (rather easily) create a custom dictionary using the included codelist to achieve exact matching, though it will not do fancy regex matching, e.g...

library(countrycode)

custom_dict <-
  unique(countrycode::codelist[, c("cldr.short.zh", "country.name.en", "iso3c")])

countrycode(
  "阿拉伯联合酋长国",
  origin = "cldr.short.zh",
  destination = "country.name.en",
  custom_dict = custom_dict
)
#> [1] "United Arab Emirates"

countrycode(
  "阿拉伯联合酋长国",
  origin = "cldr.short.zh",
  destination = "iso3c",
  custom_dict = custom_dict
)
#> [1] "ARE"

But yes, fancy Chinese regexes would be very interesting!

@turbanisch
Copy link
Author

Ah, good idea, @cjyetman ! I think eventually it would be nice to have a more lenient regex matching because even data from Chinese authorities is messy, for some countries it may list the full country name, for others just an abbreviated one.

I had quickly discarded countryname but apparently I just picked the wrong examples: neither Germany nor France have Chinese versions and the one for China matches only the full country name (aka "People's Republic of China"):

library(tidyverse)
library(countrycode)

# no matches
countryname("德国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 德国
#> [1] NA
countryname("中国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 中国
#> [1] NA

# there are Chinese country names in the data but not for all countries
# some match the official country name and may be too strict
countrycode::countryname_dict %>% 
  as_tibble() %>% 
  filter(str_detect(country.name.alt, "\\p{script=Han}")) %>% 
  filter(country.name.en %in% c("Germany", "France", "China"))
#> # A tibble: 2 × 2
#>   country.name.en country.name.alt
#>   <chr>           <chr>           
#> 1 China           中华人民共和国  
#> 2 China           中華人民共和國

Created on 2022-08-26 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Monterey 12.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2022-08-26
#>  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom         1.0.0   2022-07-01 [1] CRAN (R 4.2.0)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  countrycode * 1.4.0   2022-05-04 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr        2.2.0   2022-06-05 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr          1.4.3   2022-05-04 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.8.0   2022-07-18 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.4   2022-07-12 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  tibble      * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@turbanisch
Copy link
Author

I took a bit of a deep dive and came with regular expressions for Chinese by scraping Wikipedia. I outlined the issue (in short: Chinese has many variants, not just simplified vs. traditional scripts but also depending on regions) here and implemented a function as a proof of concept here.

Do let me know if you would like to incorporate the regular expressions into countrycode and I will try to come up with a PR! I assume this is how I would have to prepare the codes? https://github.com/vincentarelbundock/countrycode#adding-a-new-code

There is one issue though that you might want to handle differently than I did: my function converts the input into simplified characters before matching via regular expressions. I discuss alternative implementations in the README mentioned above (https://github.com/turbanisch/chinese-countryname-regex). In short, for maintenance reasons I would suggest only adding the regular expresssions for simplified Chinese and perhaps add a note in the countrycode documentation telling the user to apply a traditional-to-simplified conversion herself as needed.

@cjyetman
Copy link
Collaborator

@turbanisch this is really cool, thanks for all your effort so far

just a suggestion if/when you start building a PR... I would suggest following the model of adding known variants of each country to a data file for the tests, and then testing each one of them in a {testthat} test, as done here...
https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/data-known-name-variations.R
https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/test-known-name-variations.R

@vincentarelbundock
Copy link
Owner

That all sounds great! Thanks for putting this together.

I wont be able to look at anything in the near term, but feel free to open a PR whenever you are ready, and Ill try to review when possible.

Also agree with the know variations tests. Those would be super useful.

@turbanisch
Copy link
Author

Thanks a lot to both of you! I will see how far I get and prepare a PR in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants