Consider using fread (for larger files) #91

marks · 2016-03-25T00:38:18Z

An idea from @chursaner of the LA Mayor's Office:

I've been using RSocrata more and more lately - it is so helpful. However, some of our datasets are quite large and I've found it can take a long time to run read.socrata. When I'm working with datasets outside of Socrata, I would use the fread function from the data.table package to solve this problem. It runs faster than read.csv and then you can convert it to a dataframe from a table once it's in R. It looks just like read.csv, i.e. fread("dataset.csv").

Would it be possible to make a version of read.socrata that functioned in the same way?

marks · 2016-03-25T00:39:10Z

I dont know the intimate details here but it sounds like there is a possibility of seeing some efficiency improvements by using fread instead of read

tomschenkjr · 2016-03-25T00:43:35Z

We can use fread or read_csv from the

I should note, the current read.csv is only being used in the master branch because of a bug (see #75 and r-lib/httr#329). When httr is fixed (see r-lib/httr/pull/331), that will natively use read_csv. Nevertheless, could use either.

I've not compared fread to read_csv. Is there opinions on that?

geneorama · 2016-03-28T21:00:23Z

The bigger issue is that you need to download the data in increments of 50k rows, so you can't just use one command to (reliably) download everything one url. The API limit is 50k rows, although it's possible to do more if you use api/views/ instead of resource, e.g.

library(data.table)
system.time(dat <- fread("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv"))
# Read 20.9% of 860552 rows
# Read 44.2% of 860552 rows
# Read 65.1% of 860552 rows
# Read 81.3% of 860552 rows
# Read 94.1% of 860552 rows
# Read 860552 rows and 31 (of 31) columns from 0.246 GB file in 00:00:06
#    user  system elapsed 
#    7.16    1.27   24.63
# Warning message:
# In fread("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv") :
#   Bumped column 10 to type character on data row 375814, field contains 'HP...
str(dat)
# Classes ‘data.table’ and 'data.frame':  860552 obs. of  31 variables:
#  $ ID                               : chr  "103-20030516" "102-20020516" "2...
#  $ LICENSE ID                       : int  1337922 1228656 2442875 1337921 ...
#  $ ACCOUNT NUMBER                   : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ SITE NUMBER                      : int  1 1 2 1 1 1 1 1 1 1 ...
#  $ LEGAL NAME                       : chr  "BERGHOFF RESTAURANT COMPANY OF ...
#  $ DOING BUSINESS AS NAME           : chr  "BERGHOFF'S RESTAURANT" "BERGHOF...
#  $ ADDRESS                          : chr  "17 W ADAMS ST # 1ST" "17 W ADAM...
#  $ CITY                             : chr  "CHICAGO" "CHICAGO" "CHICAGO" "C...
#  $ STATE                            : chr  "IL" "IL" "IL" "IL" ...
#  $ ZIP CODE                         : chr  "60603" "60603" "60603" "60603" ...
#  $ WARD                             : int  42 42 42 42 42 42 42 42 42 42 ...
#  $ PRECINCT                         : int  9 9 9 9 9 9 9 9 9 9 ...
#  $ POLICE DISTRICT                  : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ LICENSE CODE                     : int  1781 1475 1475 1006 1781 1475 10...
#  $ LICENSE DESCRIPTION              : chr  "Tobacco Retail Over Counter" "C...
#  $ LICENSE NUMBER                   : int  103 102 2442875 100 103 102 100 101 101 101 ...
#  $ APPLICATION TYPE                 : chr  "RENEW" "RENEW" "ISSUE" "RENEW" ...
#  $ APPLICATION CREATED DATE         : chr  "03/25/2003" "03/27/2002" "12/30...
#  $ APPLICATION REQUIREMENTS COMPLETE: chr  "05/09/2003" "05/07/2002" "01/05...
#  $ PAYMENT DATE                     : chr  "05/09/2003" "05/07/2002" "12/30...
#  $ CONDITIONAL APPROVAL             : chr  "N" "N" "Y" "N" ...
#  $ LICENSE TERM START DATE          : chr  "05/16/2003" "05/16/2002" "03/01...
#  $ LICENSE TERM EXPIRATION DATE     : chr  "05/15/2004" "11/15/2002" "03/15...
#  $ LICENSE APPROVED FOR ISSUANCE    : chr  "05/09/2003" "05/07/2002" "03/01...
#  $ DATE ISSUED                      : chr  "06/10/2003" "05/08/2002" "03/01...
#  $ LICENSE STATUS                   : chr  "AAI" "AAI" "AAC" "AAI" ...
#  $ LICENSE STATUS CHANGE DATE       : chr  "" "" "03/08/2016" "" ...
#  $ SSA                              : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ LATITUDE                         : num  41.9 41.9 41.9 41.9 41.9 ...
#  $ LONGITUDE                        : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
#  $ LOCATION                         : chr  "(41.879341938770445, -87.628411...
#  - attr(*, ".internal.selfref")=<externalptr>

Similar results can be had with read_csv (sometimes it was faster than fread in my tests)

> system.time(datr <- read_csv("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv"))

|================================================================================| 100%  251 MB
Warning: 51 parsing failures.
   row      col   expected actual
375814 ZIP CODE an integer  H3P  
375815 ZIP CODE an integer  H3P  
449548 ZIP CODE an integer  H2T2V
449549 ZIP CODE an integer  H2T2V
449550 ZIP CODE an integer  H2T2V
...... ........ .......... ......
.See problems(...) for more details.
   user  system elapsed 
  15.61   15.24   31.22

I'm personally a fan of data.table, but it would be a lot of work for both development and documentation to use it in RSocrata.

I notice that readr returns a class of data.frame and tbl_df and tbl, whereas data.table returns a data.frame and data.table.

I don't know the ramifications of classes tbl_df and tbl, but do know that there is a learning curve when adopting data.table. I was happy to learn data.table, but I expect that from every potential contributor. If we become too tied to any package we will exclude some potenital volunteers from the other package, and it would probably be a mistake to turn away users from the ever growing Hadleysphere.

geneorama · 2016-03-28T21:01:07Z

Also note those download times are after the first download, so the data was cached on Socrata.

marks · 2016-03-28T21:06:23Z

@geneorama just a point of info.. if you use the SODA 2.1 API endpoint, there is no 50,000 record limit

tomschenkjr · 2016-03-28T21:57:27Z

@marks - To that extent, is there an easy way to determine the version of API in the API response?

shua123 · 2016-03-31T19:07:25Z

You can tell from the headers: https://dev.socrata.com/docs/endpoints.html#versioning-http-headers . The older/"2.0" has X-SODA2-Legacy-Types: true .

You could also look for newBackend=true/false in the views (https://data.cityofchicago.org/views/9pkb-4fbf)

There isn't anything though for when the API advances beyond 2.1...

tomschenkjr · 2016-03-31T23:18:48Z

So, r-lib/httr#329 has been accepted, so we can remove read.csv and can rely on read_csv which is native to httr.

Let's see how that works for speed. I'll undo the fix we put in place for #75. If the performance is still so-so, we can test fread a bit more.

tomschenkjr · 2016-03-31T23:49:54Z

Per @shua123 comments... At the very least, we could potentially have RSocrata check the headers and remove the limit= and paging?

tomschenkjr added the enhancement label Mar 25, 2016

tomschenkjr self-assigned this Mar 31, 2016

tomschenkjr added this to the v1.7.1 milestone May 26, 2016

nicklucius mentioned this issue Oct 20, 2016

Reverse fix from #75 since httr integrated changes #99

Closed

tomschenkjr added the ready label Oct 22, 2016

PriyaDoIT removed this from the v1.7.1 milestone Oct 24, 2016

PriyaDoIT removed the ready label Oct 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using fread (for larger files) #91

Consider using fread (for larger files) #91

marks commented Mar 25, 2016 •

edited by PriyaDoIT

Loading

marks commented Mar 25, 2016

tomschenkjr commented Mar 25, 2016

geneorama commented Mar 28, 2016

geneorama commented Mar 28, 2016

marks commented Mar 28, 2016

tomschenkjr commented Mar 28, 2016

shua123 commented Mar 31, 2016

tomschenkjr commented Mar 31, 2016

tomschenkjr commented Mar 31, 2016

Consider using fread (for larger files) #91

Consider using fread (for larger files) #91

Comments

marks commented Mar 25, 2016 • edited by PriyaDoIT Loading

marks commented Mar 25, 2016

tomschenkjr commented Mar 25, 2016

geneorama commented Mar 28, 2016

geneorama commented Mar 28, 2016

marks commented Mar 28, 2016

tomschenkjr commented Mar 28, 2016

shua123 commented Mar 31, 2016

tomschenkjr commented Mar 31, 2016

tomschenkjr commented Mar 31, 2016

marks commented Mar 25, 2016 •

edited by PriyaDoIT

Loading