-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read and write data as UTF-8 #8
Conversation
This describes the problem quite well: http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/ and suggests to use @LienReyserhove, that could solve the issue we have in the checklists, but @damianooldoni it won't solve reading from the API. |
@ThierryO, do you have an idea how to always use UTF-8 for reading (displaying) and writing data in RStudio that works for both Windows and Mac users? |
We have instructions on how to setup RStudio as a user. Maybe those are sufficient. Note that |
@peterdesmet , this could be an option but e.g. for the alien plants belgium checklist, the imported file is an excel file, using the read.excel statement. We cannot use Using The encoding for RStudio (as specified in the setup) is set for UTF-8. This does not solve our problem. Neither does integrating the |
Consider If that doesn't work, send me a reproducible example. |
Successfully tested with this code: # Get locale
Sys.getlocale(category = "LC_CTYPE")
# read UTF-8 text files
string_utf8 <- "¡I¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶"
test_write <- data.frame(name = string_utf8,
stringsAsFactors = FALSE)
# write UTF8 characters to fil
# read UTF8 characters from file
test_read <- read_tsv("../data/input/test_utf8.tsv")
all(test_read$name == string_utf8)
# text read Excel files with UTF-8 characters
test_read_xl <- read_xlsx("../data/input/test_utf8.xlsx",
sheet = "Blad1",
col_names = TRUE)
all(test_read_xl$name == test_read$name) |
Tested with Sys.getlocale(category = "LC_CTYPE")
#> [1] "Dutch_Belgium.1252" and # Get locale
Sys.getlocale(category = "LC_CTYPE")
#> [1] "English_United States.1252" So, as @ThierryO mentioned, RStudio has an option which enables us to encode files in UTF-8 for saving and packages |
Still, as @peterdesmet pointed out, the problem appears while writing a text file and it occurs with very few characters: one of them is the following |
Further attempts bring some more insights in the problem. > string_utf8 <- "ČǗ"
> string_utf8
[1] "CU" I would like to know by people using WIndows machines (@ThierryO? @LienReyserhove?) whether they get the same output. Any idea about, @ThierryO? I checked my global options and they are as mentioned in the tutorial. |
Closing this issue: not part of this pipeline, mostly a test. |
This branch is just there to test the UTF-8 issue. Don't merge..
Description of the problem
If no locale is set, data are not read/written as UTF-8. On Mac this can be resolved by setting
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
at the beginning of the Rmd. No additional things need to be done (like statingfileEncoding = "UTF-8"
in write.table).Reading a file with this setting on a Windows computer generates a problem however (see trias-project/alien-plants-belgium#41):
One can set another setting in Windows:
Sys.setlocale("English_United States.1252")
, but that doesn't create UTF-8 characters.What we need:
How to test this:
This branch contains an Rmd file, where one can test the output. It reads a scientificName from GBIF using rgbif that contains a character that requires UTF-8 encoding:
á
.