Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read and write data as UTF-8 #8

Closed
wants to merge 7 commits into from
Closed

Read and write data as UTF-8 #8

wants to merge 7 commits into from

Conversation

peterdesmet
Copy link
Member

This branch is just there to test the UTF-8 issue. Don't merge..

Description of the problem

If no locale is set, data are not read/written as UTF-8. On Mac this can be resolved by setting Sys.setlocale("LC_CTYPE", "en_US.UTF-8") at the beginning of the Rmd. No additional things need to be done (like stating fileEncoding = "UTF-8" in write.table).

Reading a file with this setting on a Windows computer generates a problem however (see trias-project/alien-plants-belgium#41):

Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored

One can set another setting in Windows: Sys.setlocale("English_United States.1252"), but that doesn't create UTF-8 characters.

What we need:

  • Something that reads/writes UTF-8 characters on both Mac, Linux and Windows
  • Ideally something that doesn't need to be set in each document (but rather a setting in RStudio).

How to test this:

This branch contains an Rmd file, where one can test the output. It reads a scientificName from GBIF using rgbif that contains a character that requires UTF-8 encoding: á.

@peterdesmet
Copy link
Member Author

This describes the problem quite well: http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/ and suggests to use readr for reading csv files.

@LienReyserhove, that could solve the issue we have in the checklists, but @damianooldoni it won't solve reading from the API.

@peterdesmet
Copy link
Member Author

@ThierryO, do you have an idea how to always use UTF-8 for reading (displaying) and writing data in RStudio that works for both Windows and Mac users?

@ThierryO
Copy link

We have instructions on how to setup RStudio as a user. Maybe those are sufficient.

Note that write.table() has a fileEncoding argument. And there is enc2utf() to convert a vector into UTF-8.

@LienReyserhove
Copy link

LienReyserhove commented Jan 2, 2018

@peterdesmet , this could be an option but e.g. for the alien plants belgium checklist, the imported file is an excel file, using the read.excel statement. We cannot use readr for this. There's a similar locale statement for reading Excel files, but this renders the same problem as before, i.e. the statement differs between different OS.

Using enc2utf8() to change the encoding from scientificName to UTF-8 did not generate a UTF-8 encoded scientificName, but this is probably due to the import.

The encoding for RStudio (as specified in the setup) is set for UTF-8. This does not solve our problem. Neither does integrating the fileEncoding argument in the write.table() argument.

@ThierryO
Copy link

ThierryO commented Jan 2, 2018

Consider read_excel() from the readxl package. That uses UTF-8 by default.

If that doesn't work, send me a reproducible example.

@damianooldoni
Copy link
Contributor

Successfully tested with this code:

# Get locale
Sys.getlocale(category = "LC_CTYPE")

# read UTF-8 text files
string_utf8 <- "¡I¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶"
test_write <- data.frame(name = string_utf8,
                         stringsAsFactors = FALSE)
# write UTF8 characters to fil
# read UTF8 characters from file
test_read <- read_tsv("../data/input/test_utf8.tsv")
all(test_read$name == string_utf8)

# text read Excel files with UTF-8 characters
test_read_xl <- read_xlsx("../data/input/test_utf8.xlsx", 
                          sheet = "Blad1",
                          col_names = TRUE)
all(test_read_xl$name == test_read$name)

@damianooldoni
Copy link
Contributor

Tested with

Sys.getlocale(category = "LC_CTYPE")
#> [1] "Dutch_Belgium.1252"

and

# Get locale
Sys.getlocale(category = "LC_CTYPE")
#> [1] "English_United States.1252"

So, as @ThierryO mentioned, RStudio has an option which enables us to encode files in UTF-8 for saving and packages readr and readxl automatically choose for UTF-8.

@damianooldoni
Copy link
Contributor

damianooldoni commented Aug 7, 2018

Still, as @peterdesmet pointed out, the problem appears while writing a text file and it occurs with very few characters: one of them is the following Č.

@damianooldoni
Copy link
Contributor

Further attempts bring some more insights in the problem.
The problems with UTF-8 characters are restricted to letters with "strange" accents as ČǗ. The rest is ok. So, the problem is not while writing or reading, but internal in RStudio.
Here an example of the problems on my pc:

> string_utf8 <- "ČǗ"
> string_utf8
[1] "CU"

I would like to know by people using WIndows machines (@ThierryO? @LienReyserhove?) whether they get the same output. Any idea about, @ThierryO? I checked my global options and they are as mentioned in the tutorial.

@peterdesmet
Copy link
Member Author

Closing this issue: not part of this pipeline, mostly a test.

@peterdesmet peterdesmet deleted the utf8-test branch December 11, 2018 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants