-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using fread (for larger files) #91
Comments
I dont know the intimate details here but it sounds like there is a possibility of seeing some efficiency improvements by using |
We can use I should note, the current I've not compared |
The bigger issue is that you need to download the data in increments of 50k rows, so you can't just use one command to (reliably) download everything one url. The API limit is 50k rows, although it's possible to do more if you use library(data.table)
system.time(dat <- fread("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv"))
# Read 20.9% of 860552 rows
# Read 44.2% of 860552 rows
# Read 65.1% of 860552 rows
# Read 81.3% of 860552 rows
# Read 94.1% of 860552 rows
# Read 860552 rows and 31 (of 31) columns from 0.246 GB file in 00:00:06
# user system elapsed
# 7.16 1.27 24.63
# Warning message:
# In fread("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv") :
# Bumped column 10 to type character on data row 375814, field contains 'HP...
str(dat)
# Classes ‘data.table’ and 'data.frame': 860552 obs. of 31 variables:
# $ ID : chr "103-20030516" "102-20020516" "2...
# $ LICENSE ID : int 1337922 1228656 2442875 1337921 ...
# $ ACCOUNT NUMBER : int 1 1 1 1 1 1 1 1 1 1 ...
# $ SITE NUMBER : int 1 1 2 1 1 1 1 1 1 1 ...
# $ LEGAL NAME : chr "BERGHOFF RESTAURANT COMPANY OF ...
# $ DOING BUSINESS AS NAME : chr "BERGHOFF'S RESTAURANT" "BERGHOF...
# $ ADDRESS : chr "17 W ADAMS ST # 1ST" "17 W ADAM...
# $ CITY : chr "CHICAGO" "CHICAGO" "CHICAGO" "C...
# $ STATE : chr "IL" "IL" "IL" "IL" ...
# $ ZIP CODE : chr "60603" "60603" "60603" "60603" ...
# $ WARD : int 42 42 42 42 42 42 42 42 42 42 ...
# $ PRECINCT : int 9 9 9 9 9 9 9 9 9 9 ...
# $ POLICE DISTRICT : int 1 1 1 1 1 1 1 1 1 1 ...
# $ LICENSE CODE : int 1781 1475 1475 1006 1781 1475 10...
# $ LICENSE DESCRIPTION : chr "Tobacco Retail Over Counter" "C...
# $ LICENSE NUMBER : int 103 102 2442875 100 103 102 100 101 101 101 ...
# $ APPLICATION TYPE : chr "RENEW" "RENEW" "ISSUE" "RENEW" ...
# $ APPLICATION CREATED DATE : chr "03/25/2003" "03/27/2002" "12/30...
# $ APPLICATION REQUIREMENTS COMPLETE: chr "05/09/2003" "05/07/2002" "01/05...
# $ PAYMENT DATE : chr "05/09/2003" "05/07/2002" "12/30...
# $ CONDITIONAL APPROVAL : chr "N" "N" "Y" "N" ...
# $ LICENSE TERM START DATE : chr "05/16/2003" "05/16/2002" "03/01...
# $ LICENSE TERM EXPIRATION DATE : chr "05/15/2004" "11/15/2002" "03/15...
# $ LICENSE APPROVED FOR ISSUANCE : chr "05/09/2003" "05/07/2002" "03/01...
# $ DATE ISSUED : chr "06/10/2003" "05/08/2002" "03/01...
# $ LICENSE STATUS : chr "AAI" "AAI" "AAC" "AAI" ...
# $ LICENSE STATUS CHANGE DATE : chr "" "" "03/08/2016" "" ...
# $ SSA : int 1 1 1 1 1 1 1 1 1 1 ...
# $ LATITUDE : num 41.9 41.9 41.9 41.9 41.9 ...
# $ LONGITUDE : num -87.6 -87.6 -87.6 -87.6 -87.6 ...
# $ LOCATION : chr "(41.879341938770445, -87.628411...
# - attr(*, ".internal.selfref")=<externalptr> Similar results can be had with > system.time(datr <- read_csv("https://data.cityofchicago.org/api/views/r5kz-chrr/rows.csv"))
|================================================================================| 100% 251 MB
Warning: 51 parsing failures.
row col expected actual
375814 ZIP CODE an integer H3P
375815 ZIP CODE an integer H3P
449548 ZIP CODE an integer H2T2V
449549 ZIP CODE an integer H2T2V
449550 ZIP CODE an integer H2T2V
...... ........ .......... ......
.See problems(...) for more details.
user system elapsed
15.61 15.24 31.22 I'm personally a fan of I notice that I don't know the ramifications of classes |
Also note those download times are after the first download, so the data was cached on Socrata. |
@geneorama just a point of info.. if you use the SODA 2.1 API endpoint, there is no 50,000 record limit |
@marks - To that extent, is there an easy way to determine the version of API in the API response? |
You can tell from the headers: https://dev.socrata.com/docs/endpoints.html#versioning-http-headers . The older/"2.0" has X-SODA2-Legacy-Types: true . You could also look for newBackend=true/false in the views (https://data.cityofchicago.org/views/9pkb-4fbf) There isn't anything though for when the API advances beyond 2.1... |
So, r-lib/httr#329 has been accepted, so we can remove Let's see how that works for speed. I'll undo the fix we put in place for #75. If the performance is still so-so, we can test |
Per @shua123 comments... At the very least, we could potentially have RSocrata check the headers and remove the |
An idea from @chursaner of the LA Mayor's Office:
The text was updated successfully, but these errors were encountered: