Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using fread (for larger files) #91

Open
marks opened this issue Mar 25, 2016 · 9 comments
Open

Consider using fread (for larger files) #91

marks opened this issue Mar 25, 2016 · 9 comments
Assignees

Comments

@marks
Copy link
Contributor

marks commented Mar 25, 2016

An idea from @chursaner of the LA Mayor's Office:

I've been using RSocrata more and more lately - it is so helpful. However, some of our datasets are quite large and I've found it can take a long time to run read.socrata. When I'm working with datasets outside of Socrata, I would use the fread function from the data.table package to solve this problem. It runs faster than read.csv and then you can convert it to a dataframe from a table once it's in R. It looks just like read.csv, i.e. fread("dataset.csv").

Would it be possible to make a version of read.socrata that functioned in the same way?

@marks
Copy link
Contributor Author

marks commented Mar 25, 2016

I dont know the intimate details here but it sounds like there is a possibility of seeing some efficiency improvements by using fread instead of read

@tomschenkjr
Copy link
Contributor

We can use fread or read_csv from the

I should note, the current read.csv is only being used in the master branch because of a bug (see #75 and r-lib/httr#329). When httr is fixed (see r-lib/httr/pull/331), that will natively use read_csv. Nevertheless, could use either.

I've not compared fread to read_csv. Is there opinions on that?

@geneorama
Copy link
Member

The bigger issue is that you need to download the data in increments of 50k rows, so you can't just use one command to (reliably) download everything one url. The API limit is 50k rows, although it's possible to do more if you use api/views/ instead of resource, e.g.

library(data.table)
system.time(dat <- fread("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv"))
# Read 20.9% of 860552 rows
# Read 44.2% of 860552 rows
# Read 65.1% of 860552 rows
# Read 81.3% of 860552 rows
# Read 94.1% of 860552 rows
# Read 860552 rows and 31 (of 31) columns from 0.246 GB file in 00:00:06
#    user  system elapsed 
#    7.16    1.27   24.63
# Warning message:
# In fread("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv") :
#   Bumped column 10 to type character on data row 375814, field contains 'HP...
str(dat)
# Classes ‘data.table’ and 'data.frame':  860552 obs. of  31 variables:
#  $ ID                               : chr  "103-20030516" "102-20020516" "2...
#  $ LICENSE ID                       : int  1337922 1228656 2442875 1337921 ...
#  $ ACCOUNT NUMBER                   : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ SITE NUMBER                      : int  1 1 2 1 1 1 1 1 1 1 ...
#  $ LEGAL NAME                       : chr  "BERGHOFF RESTAURANT COMPANY OF ...
#  $ DOING BUSINESS AS NAME           : chr  "BERGHOFF'S RESTAURANT" "BERGHOF...
#  $ ADDRESS                          : chr  "17 W ADAMS ST # 1ST" "17 W ADAM...
#  $ CITY                             : chr  "CHICAGO" "CHICAGO" "CHICAGO" "C...
#  $ STATE                            : chr  "IL" "IL" "IL" "IL" ...
#  $ ZIP CODE                         : chr  "60603" "60603" "60603" "60603" ...
#  $ WARD                             : int  42 42 42 42 42 42 42 42 42 42 ...
#  $ PRECINCT                         : int  9 9 9 9 9 9 9 9 9 9 ...
#  $ POLICE DISTRICT                  : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ LICENSE CODE                     : int  1781 1475 1475 1006 1781 1475 10...
#  $ LICENSE DESCRIPTION              : chr  "Tobacco Retail Over Counter" "C...
#  $ LICENSE NUMBER                   : int  103 102 2442875 100 103 102 100 101 101 101 ...
#  $ APPLICATION TYPE                 : chr  "RENEW" "RENEW" "ISSUE" "RENEW" ...
#  $ APPLICATION CREATED DATE         : chr  "03/25/2003" "03/27/2002" "12/30...
#  $ APPLICATION REQUIREMENTS COMPLETE: chr  "05/09/2003" "05/07/2002" "01/05...
#  $ PAYMENT DATE                     : chr  "05/09/2003" "05/07/2002" "12/30...
#  $ CONDITIONAL APPROVAL             : chr  "N" "N" "Y" "N" ...
#  $ LICENSE TERM START DATE          : chr  "05/16/2003" "05/16/2002" "03/01...
#  $ LICENSE TERM EXPIRATION DATE     : chr  "05/15/2004" "11/15/2002" "03/15...
#  $ LICENSE APPROVED FOR ISSUANCE    : chr  "05/09/2003" "05/07/2002" "03/01...
#  $ DATE ISSUED                      : chr  "06/10/2003" "05/08/2002" "03/01...
#  $ LICENSE STATUS                   : chr  "AAI" "AAI" "AAC" "AAI" ...
#  $ LICENSE STATUS CHANGE DATE       : chr  "" "" "03/08/2016" "" ...
#  $ SSA                              : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ LATITUDE                         : num  41.9 41.9 41.9 41.9 41.9 ...
#  $ LONGITUDE                        : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
#  $ LOCATION                         : chr  "(41.879341938770445, -87.628411...
#  - attr(*, ".internal.selfref")=<externalptr>

Similar results can be had with read_csv (sometimes it was faster than fread in my tests)

> system.time(datr <- read_csv("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv"))

|================================================================================| 100%  251 MB
Warning: 51 parsing failures.
   row      col   expected actual
375814 ZIP CODE an integer  H3P  
375815 ZIP CODE an integer  H3P  
449548 ZIP CODE an integer  H2T2V
449549 ZIP CODE an integer  H2T2V
449550 ZIP CODE an integer  H2T2V
...... ........ .......... ......
.See problems(...) for more details.
   user  system elapsed 
  15.61   15.24   31.22 

I'm personally a fan of data.table, but it would be a lot of work for both development and documentation to use it in RSocrata.

I notice that readr returns a class of data.frame and tbl_df and tbl, whereas data.table returns a data.frame and data.table.

I don't know the ramifications of classes tbl_df and tbl, but do know that there is a learning curve when adopting data.table. I was happy to learn data.table, but I expect that from every potential contributor. If we become too tied to any package we will exclude some potenital volunteers from the other package, and it would probably be a mistake to turn away users from the ever growing Hadleysphere.

@geneorama
Copy link
Member

Also note those download times are after the first download, so the data was cached on Socrata.

@marks
Copy link
Contributor Author

marks commented Mar 28, 2016

@geneorama just a point of info.. if you use the SODA 2.1 API endpoint, there is no 50,000 record limit

@tomschenkjr
Copy link
Contributor

@marks - To that extent, is there an easy way to determine the version of API in the API response?

@shua123
Copy link

shua123 commented Mar 31, 2016

You can tell from the headers: https://dev.socrata.com/docs/endpoints.html#versioning-http-headers . The older/"2.0" has X-SODA2-Legacy-Types: true .

You could also look for newBackend=true/false in the views (https://data.cityofchicago.org/views/9pkb-4fbf)

There isn't anything though for when the API advances beyond 2.1...

@tomschenkjr tomschenkjr self-assigned this Mar 31, 2016
@tomschenkjr
Copy link
Contributor

So, r-lib/httr#329 has been accepted, so we can remove read.csv and can rely on read_csv which is native to httr.

Let's see how that works for speed. I'll undo the fix we put in place for #75. If the performance is still so-so, we can test fread a bit more.

@tomschenkjr
Copy link
Contributor

Per @shua123 comments... At the very least, we could potentially have RSocrata check the headers and remove the limit= and paging?

@tomschenkjr tomschenkjr added this to the v1.7.1 milestone May 26, 2016
@PriyaDoIT PriyaDoIT removed this from the v1.7.1 milestone Oct 24, 2016
@PriyaDoIT PriyaDoIT removed the ready label Oct 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants