-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-English characters in paths break the AWK call #26
Comments
A related issue just happened to me when trying to read the entire Basic Dataset for Arizona via Warning message:
In fread(file, showProgress = TRUE, skip = 5148250, nrows = 10, :
Stopped early on line 5148258. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372 2018-08-06 16:15:54.0 19232 species Red-eyed Vireo Vireo olivaceus 1 United States US Arizona US-AZ Maricopa US-AZ-013 US-AZ_3068 33 Riparian Preserve at Gilbert Water Ranch L144858 H 33.3614502 -111.7339478 2017-10-29 12:32:00 obsr218773 S40195602 Stationary P21 EBIRD 45 1 0 0 1 1 Targeted: REVI —>> Here's a > read_lines(file, skip = 5148257, n_max = 1)
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS545330372\t2018-08-06 16:15:54.0\t19232\tspecies\tRed-eyed Vireo\tVireo olivaceus\t\t\t1\t\t\t\tUnited States\tUS\tArizona\tUS-AZ\tMaricopa\tUS-AZ-013\tUS-AZ_3068\t33\t\t\tRiparian Preserve at Gilbert Water Ranch\tL144858\tH\t33.3614502\t-111.7339478\t2017-10-29\t12:32:00\tobsr218773\tS40195602\tStationary\tP21\tEBIRD\t45\t\t\t1\t0\t\t0\t1\t1\t\t\tTargeted: REVI —" Checking out the checklist record on the eBird website, the last field is supposed to continue on: https://ebird.org/view/checklist/S40195602 Something about that long-dash is cutting off the rest of the data in the field, including the last tab character. Thankfully this was the only record with an issue in 12 million+ records, and I just fixed the line manually. I was able to read the entire data file after this. I'm not sure if these errors need be fixed in the database itself since it's cutting off the data before it gets to the end user. Hopefully this issue is an appropriate venue for my problem since it's probably not |
@matthewpaulking this valuable info, thanks! Can you let me know which version of the EBD you're using, and whether it's the full (200 GB) file or if you just downloaded the Arizona subset. |
Sure thing! I'm using just the Arizona subset (4.4 GB), and it's from October 2018: "ebd_US-AZ_relOct-2018". Thanks for your quick reply and all your work on this package! |
Looks like it's not the "–" character, but a strange character that comes after it that's causing the problem. Seems it's an "embedded nul", which is discussed in this StackOverflow question. I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future. I'll need to think about this a little more. Thanks for bringing it to my attention! |
Wow, I had no idea that"embedded nul" existed! This is good to know for future reference. So is this an encoding thing coming from the eBird app? I just noticed this particular record was coming from "eBird for iOS, version 1.5.149". I sometimes run into weird encodings (not in eBird in particular, but other data) that are fixed by |
I am having the same problem. When filtering the Dec 2018 ebd file then reading the resulting TXT file with read_ebd(). Filtering works fine. But reading into R breaks. I get an fread error: Warning message: Definitely an encoding issue. The warning message suggests tweaking some fread arguments. So I tried: dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T) and got the error: Error in read_ebd(f_out_ebd, reader = "fread", unique = F, rollup = F, : Is it possible to pass fread arguments through the read_ebd() function, so we can work around the issue? Thanks for the great package! |
Here is the reproducible example code for Windows 10 machine, recent R and auk installs: library(tidyverse) f_ebd <- "ebd_relDec-2018.txt" ebd_filters <- auk_ebd(f_ebd) %>% f_out_ebd <- "ebd_test.txt" dat1 <- read_ebd(f_out_ebd, unique=F, rollup=F) |
Someone I work with found a way to do this with the Unix program tr. Would it be helpful for me to pass on that code? Doing this in advance sounds like a great service to eBird users. |
A user found a bug in
auk_filter()
resulting from Turkish characters (e.g. the "İ" inİbrahim
) in the EBD path.The text was updated successfully, but these errors were encountered: