Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-English characters in paths break the AWK call #26

Open
mstrimas opened this issue Mar 5, 2019 · 8 comments
Open

Non-English characters in paths break the AWK call #26

mstrimas opened this issue Mar 5, 2019 · 8 comments

Comments

@mstrimas
Copy link
Contributor

mstrimas commented Mar 5, 2019

A user found a bug in auk_filter() resulting from Turkish characters (e.g. the "İ" in İbrahim) in the EBD path.

@matthewpaulking
Copy link

A related issue just happened to me when trying to read the entire Basic Dataset for Arizona via data.tables. I wanted to add another example of a non-standard character breaking the fread() call:

Warning message:
In fread(file, showProgress = TRUE, skip = 5148250, nrows = 10,  :
  Stopped early on line 5148258. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372	2018-08-06 16:15:54.0	19232	species	Red-eyed Vireo	Vireo olivaceus			1				United States	US	Arizona	US-AZ	Maricopa	US-AZ-013	US-AZ_3068	33			Riparian Preserve at Gilbert Water Ranch	L144858	H	33.3614502	-111.7339478	2017-10-29	12:32:00	obsr218773	S40195602	Stationary	P21	EBIRD	45			1	0		0	1	1			Targeted: REVI —>>

Here's a read_lines() call to show the raw string. You can see the last tab character is missing:

> read_lines(file, skip = 5148257, n_max = 1)
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS545330372\t2018-08-06 16:15:54.0\t19232\tspecies\tRed-eyed Vireo\tVireo olivaceus\t\t\t1\t\t\t\tUnited States\tUS\tArizona\tUS-AZ\tMaricopa\tUS-AZ-013\tUS-AZ_3068\t33\t\t\tRiparian Preserve at Gilbert Water Ranch\tL144858\tH\t33.3614502\t-111.7339478\t2017-10-29\t12:32:00\tobsr218773\tS40195602\tStationary\tP21\tEBIRD\t45\t\t\t1\t0\t\t0\t1\t1\t\t\tTargeted: REVI —"

Checking out the checklist record on the eBird website, the last field is supposed to continue on:

https://ebird.org/view/checklist/S40195602

image

Something about that long-dash is cutting off the rest of the data in the field, including the last tab character. Thankfully this was the only record with an issue in 12 million+ records, and I just fixed the line manually. I was able to read the entire data file after this.

I'm not sure if these errors need be fixed in the database itself since it's cutting off the data before it gets to the end user. Hopefully this issue is an appropriate venue for my problem since it's probably not auk specific.

@mstrimas
Copy link
Contributor Author

@matthewpaulking this valuable info, thanks! Can you let me know which version of the EBD you're using, and whether it's the full (200 GB) file or if you just downloaded the Arizona subset.

@matthewpaulking
Copy link

Sure thing! I'm using just the Arizona subset (4.4 GB), and it's from October 2018: "ebd_US-AZ_relOct-2018". Thanks for your quick reply and all your work on this package!

@mstrimas
Copy link
Contributor Author

Looks like it's not the "–" character, but a strange character that comes after it that's causing the problem. Seems it's an "embedded nul", which is discussed in this StackOverflow question. I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future. I'll need to think about this a little more. Thanks for bringing it to my attention!

@matthewpaulking
Copy link

Wow, I had no idea that"embedded nul" existed! This is good to know for future reference. So is this an encoding thing coming from the eBird app? I just noticed this particular record was coming from "eBird for iOS, version 1.5.149".

I sometimes run into weird encodings (not in eBird in particular, but other data) that are fixed by stringi::stri_trans_general(<string>, 'latin-ascii'). But this seems a different issue.

@tmeeha
Copy link

tmeeha commented May 28, 2019

I am having the same problem. When filtering the Dec 2018 ebd file then reading the resulting TXT file with read_ebd(). Filtering works fine. But reading into R breaks. I get an fread error:

Warning message:
In data.table::fread(x, sep = sep, quote = "", na.strings = "", :
Stopped early on line 1348203. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372 2018-08-06 16:15:54 19232 species Red-eyed Vireo Vireo olivaceus 1 United States US Arizona US-AZ Maricopa US-AZ-013 US-AZ_3068 33 Riparian Preserve at Gilbert Water Ranch L144858 H 33.3614502 -111.7339478 2017-10-29 12:32:00 obsr218773 S40195602 Stationary P21 EBIRD 45 1 0 0 1 1 Targeted: REVI —>>

Definitely an encoding issue. The warning message suggests tweaking some fread arguments. So I tried:

dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)

and got the error:

Error in read_ebd(f_out_ebd, reader = "fread", unique = F, rollup = F, :
unused argument (fill = T)

Is it possible to pass fread arguments through the read_ebd() function, so we can work around the issue? Thanks for the great package!

@tmeeha
Copy link

tmeeha commented May 28, 2019

Here is the reproducible example code for Windows 10 machine, recent R and auk installs:

library(tidyverse)
library(auk)

f_ebd <- "ebd_relDec-2018.txt"
target_bbox <- c(-180, -90, 0, 90)
target_date <- c("1980-01-01", "2018-12-31")
target_species <- c("Red-eyed Vireo")

ebd_filters <- auk_ebd(f_ebd) %>%
auk_bbox(bbox=target_bbox) %>%
auk_date(date=target_date) %>%
auk_species(species=target_species)

f_out_ebd <- "ebd_test.txt"
ebd_filtered <- auk_filter(ebd_filters, file=f_out_ebd,
overwrite=T)

dat1 <- read_ebd(f_out_ebd, unique=F, rollup=F)
dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)

@tmeeha
Copy link

tmeeha commented May 29, 2019

I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future.

Someone I work with found a way to do this with the Unix program tr. Would it be helpful for me to pass on that code? Doing this in advance sounds like a great service to eBird users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants