-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Chinese as origin language #316
Comments
If you can come up with regexes, that would lovely! But have you tried the |
Since {countrycode} does include CLDR names in Chinese, one could (rather easily) create a custom dictionary using the included library(countrycode)
custom_dict <-
unique(countrycode::codelist[, c("cldr.short.zh", "country.name.en", "iso3c")])
countrycode(
"阿拉伯联合酋长国",
origin = "cldr.short.zh",
destination = "country.name.en",
custom_dict = custom_dict
)
#> [1] "United Arab Emirates"
countrycode(
"阿拉伯联合酋长国",
origin = "cldr.short.zh",
destination = "iso3c",
custom_dict = custom_dict
)
#> [1] "ARE" But yes, fancy Chinese regexes would be very interesting! |
Ah, good idea, @cjyetman ! I think eventually it would be nice to have a more lenient regex matching because even data from Chinese authorities is messy, for some countries it may list the full country name, for others just an abbreviated one. I had quickly discarded library(tidyverse)
library(countrycode)
# no matches
countryname("德国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 德国
#> [1] NA
countryname("中国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 中国
#> [1] NA
# there are Chinese country names in the data but not for all countries
# some match the official country name and may be too strict
countrycode::countryname_dict %>%
as_tibble() %>%
filter(str_detect(country.name.alt, "\\p{script=Han}")) %>%
filter(country.name.en %in% c("Germany", "France", "China"))
#> # A tibble: 2 × 2
#> country.name.en country.name.alt
#> <chr> <chr>
#> 1 China 中华人民共和国
#> 2 China 中華人民共和國 Created on 2022-08-26 by the reprex package (v2.0.1) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.0 (2022-04-22)
#> os macOS Monterey 12.5
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2022-08-26
#> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.0 2022-07-01 [1] CRAN (R 4.2.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#> countrycode * 1.4.0 2022-05-04 [1] CRAN (R 4.2.0)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
#> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0)
#> dbplyr 2.2.0 2022-06-05 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
#> dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
#> haven 2.5.0 2022-04-15 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> hms 1.1.1 2021-09-26 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
#> httr 1.4.3 2022-05-04 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
#> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
#> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> pillar 1.8.0 2022-07-18 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> readr * 2.1.2 2022-01-30 [1] CRAN (R 4.2.0)
#> readxl 1.4.0 2022-03-28 [1] CRAN (R 4.2.0)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0)
#> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.0)
#> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
#> rvest 1.0.2 2021-10-16 [1] CRAN (R 4.2.0)
#> scales 1.2.0 2022-04-13 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
#> tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
#> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
#> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.2.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
I took a bit of a deep dive and came with regular expressions for Chinese by scraping Wikipedia. I outlined the issue (in short: Chinese has many variants, not just simplified vs. traditional scripts but also depending on regions) here and implemented a function as a proof of concept here. Do let me know if you would like to incorporate the regular expressions into There is one issue though that you might want to handle differently than I did: my function converts the input into simplified characters before matching via regular expressions. I discuss alternative implementations in the README mentioned above (https://github.com/turbanisch/chinese-countryname-regex). In short, for maintenance reasons I would suggest only adding the regular expresssions for simplified Chinese and perhaps add a note in the |
@turbanisch this is really cool, thanks for all your effort so far just a suggestion if/when you start building a PR... I would suggest following the model of adding known variants of each country to a data file for the tests, and then testing each one of them in a {testthat} test, as done here... |
That all sounds great! Thanks for putting this together. I won Also agree with the know variations tests. Those would be super useful. |
Thanks a lot to both of you! I will see how far I get and prepare a PR in the coming days. |
So far, there seems to be no way of converting country names from Chinese, not even by left-joining any of the dataframes that come with countrycode. I assume the reason for this is the lack of a corresponding set of regexes? If there is any interest, I would like to offer to come up with one!
The text was updated successfully, but these errors were encountered: