Non-English characters in paths break the AWK call #26

mstrimas · 2019-03-05T14:35:57Z

A user found a bug in auk_filter() resulting from Turkish characters (e.g. the "İ" in İbrahim) in the EBD path.

The text was updated successfully, but these errors were encountered:

matthewpaulking · 2019-03-28T18:46:12Z

A related issue just happened to me when trying to read the entire Basic Dataset for Arizona via data.tables. I wanted to add another example of a non-standard character breaking the fread() call:

Warning message:
In fread(file, showProgress = TRUE, skip = 5148250, nrows = 10,  :
  Stopped early on line 5148258. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372	2018-08-06 16:15:54.0	19232	species	Red-eyed Vireo	Vireo olivaceus			1				United States	US	Arizona	US-AZ	Maricopa	US-AZ-013	US-AZ_3068	33			Riparian Preserve at Gilbert Water Ranch	L144858	H	33.3614502	-111.7339478	2017-10-29	12:32:00	obsr218773	S40195602	Stationary	P21	EBIRD	45			1	0		0	1	1			Targeted: REVI â€”>>

Here's a read_lines() call to show the raw string. You can see the last tab character is missing:

> read_lines(file, skip = 5148257, n_max = 1)
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS545330372\t2018-08-06 16:15:54.0\t19232\tspecies\tRed-eyed Vireo\tVireo olivaceus\t\t\t1\t\t\t\tUnited States\tUS\tArizona\tUS-AZ\tMaricopa\tUS-AZ-013\tUS-AZ_3068\t33\t\t\tRiparian Preserve at Gilbert Water Ranch\tL144858\tH\t33.3614502\t-111.7339478\t2017-10-29\t12:32:00\tobsr218773\tS40195602\tStationary\tP21\tEBIRD\t45\t\t\t1\t0\t\t0\t1\t1\t\t\tTargeted: REVI —"

Checking out the checklist record on the eBird website, the last field is supposed to continue on:

https://ebird.org/view/checklist/S40195602

Something about that long-dash is cutting off the rest of the data in the field, including the last tab character. Thankfully this was the only record with an issue in 12 million+ records, and I just fixed the line manually. I was able to read the entire data file after this.

I'm not sure if these errors need be fixed in the database itself since it's cutting off the data before it gets to the end user. Hopefully this issue is an appropriate venue for my problem since it's probably not auk specific.

mstrimas · 2019-03-28T18:53:32Z

@matthewpaulking this valuable info, thanks! Can you let me know which version of the EBD you're using, and whether it's the full (200 GB) file or if you just downloaded the Arizona subset.

matthewpaulking · 2019-03-28T19:37:14Z

Sure thing! I'm using just the Arizona subset (4.4 GB), and it's from October 2018: "ebd_US-AZ_relOct-2018". Thanks for your quick reply and all your work on this package!

mstrimas · 2019-03-28T23:16:00Z

Looks like it's not the "–" character, but a strange character that comes after it that's causing the problem. Seems it's an "embedded nul", which is discussed in this StackOverflow question. I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future. I'll need to think about this a little more. Thanks for bringing it to my attention!

matthewpaulking · 2019-03-29T17:04:15Z

Wow, I had no idea that"embedded nul" existed! This is good to know for future reference. So is this an encoding thing coming from the eBird app? I just noticed this particular record was coming from "eBird for iOS, version 1.5.149".

I sometimes run into weird encodings (not in eBird in particular, but other data) that are fixed by stringi::stri_trans_general(<string>, 'latin-ascii'). But this seems a different issue.

tmeeha · 2019-05-28T22:37:06Z

I am having the same problem. When filtering the Dec 2018 ebd file then reading the resulting TXT file with read_ebd(). Filtering works fine. But reading into R breaks. I get an fread error:

Warning message:
In data.table::fread(x, sep = sep, quote = "", na.strings = "", :
Stopped early on line 1348203. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372 2018-08-06 16:15:54 19232 species Red-eyed Vireo Vireo olivaceus 1 United States US Arizona US-AZ Maricopa US-AZ-013 US-AZ_3068 33 Riparian Preserve at Gilbert Water Ranch L144858 H 33.3614502 -111.7339478 2017-10-29 12:32:00 obsr218773 S40195602 Stationary P21 EBIRD 45 1 0 0 1 1 Targeted: REVI â€”>>

Definitely an encoding issue. The warning message suggests tweaking some fread arguments. So I tried:

dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)

and got the error:

Error in read_ebd(f_out_ebd, reader = "fread", unique = F, rollup = F, :
unused argument (fill = T)

Is it possible to pass fread arguments through the read_ebd() function, so we can work around the issue? Thanks for the great package!

tmeeha · 2019-05-28T22:50:43Z

Here is the reproducible example code for Windows 10 machine, recent R and auk installs:

library(tidyverse)
library(auk)

f_ebd <- "ebd_relDec-2018.txt"
target_bbox <- c(-180, -90, 0, 90)
target_date <- c("1980-01-01", "2018-12-31")
target_species <- c("Red-eyed Vireo")

ebd_filters <- auk_ebd(f_ebd) %>%
auk_bbox(bbox=target_bbox) %>%
auk_date(date=target_date) %>%
auk_species(species=target_species)

f_out_ebd <- "ebd_test.txt"
ebd_filtered <- auk_filter(ebd_filters, file=f_out_ebd,
overwrite=T)

dat1 <- read_ebd(f_out_ebd, unique=F, rollup=F)
dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)

tmeeha · 2019-05-29T15:26:22Z

I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future.

Someone I work with found a way to do this with the Unix program tr. Would it be helpful for me to pass on that code? Doing this in advance sounds like a great service to eBird users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-English characters in paths break the AWK call #26

Non-English characters in paths break the AWK call #26

mstrimas commented Mar 5, 2019

matthewpaulking commented Mar 28, 2019

mstrimas commented Mar 28, 2019

matthewpaulking commented Mar 28, 2019

mstrimas commented Mar 28, 2019

matthewpaulking commented Mar 29, 2019

tmeeha commented May 28, 2019

tmeeha commented May 28, 2019 •

edited

Loading

tmeeha commented May 29, 2019 •

edited

Loading

Non-English characters in paths break the AWK call #26

Non-English characters in paths break the AWK call #26

Comments

mstrimas commented Mar 5, 2019

matthewpaulking commented Mar 28, 2019

mstrimas commented Mar 28, 2019

matthewpaulking commented Mar 28, 2019

mstrimas commented Mar 28, 2019

matthewpaulking commented Mar 29, 2019

tmeeha commented May 28, 2019

tmeeha commented May 28, 2019 • edited Loading

tmeeha commented May 29, 2019 • edited Loading

tmeeha commented May 28, 2019 •

edited

Loading

tmeeha commented May 29, 2019 •

edited

Loading