forked from WinVector/zmPDSwR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.html
62 lines (62 loc) · 4.64 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
<h1 id="data-from-2011-acs-1-year-pums-fields-american-community-survey">Data from 2011 ACS 1-year PUMS fields American Community Survey</h1>
<p>This is an example how to load the Census data used in the book. We have the data pre-prepared <a href="https://github.com/WinVector/zmPDSwR/blob/master/PUMS/phsample.RData">here</a> so you do not need to perform any of these steps unless you want to learn about SQLscrewdriver (not a requirement for later exercises in the book).</p>
<p>Original data is not included here for reasons of license and size.</p>
<p>To download data:</p>
<ul>
<li>Go to original Census link <a href="http://www.census.gov/acs/www/data_documentation/pums_data/">http://www.census.gov/acs/www/data_documentation/pums_data/</a></li>
<li>Navigate to <a href="http://factfinder2.census.gov/bkmk/navigation/1.0/en/d_dataset:ACS_11_1YR/d_product_type:PUMS">2011 ACS 1-year PUMS</a></li>
<li>Dowload <a href="http://factfinder2.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2011&prodType=document">2011 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format</a></li>
<li>Download <a href="http://www2.census.gov/acs2011_1yr/pums/csv_pus.zip">United States Population Records</a> as csv_pus.zip and <a href="http://www2.census.gov/acs2011_1yr/pums/csv_hus.zip">United States Housing Unit Records</a> as csv_hus.zip.</li>
<li>Unzip csv_pus.zip and csv_hus.zip. The data you want is:
<ul>
<li>csv_hus/ss11husa.csv</li>
<li>csv_hus/ss11husb.csv</li>
<li>csv_pus/ss11pusa.csv</li>
<li>csv_pus/ss11pusb.csv</li>
</ul></li>
</ul>
<p>Leave data in these directories and one directory above csv_hus and csv_pus run the following at the command line:</p>
<pre><code>java -classpath SQLScrewdriver.jar:h2-1.3.170.jar \
com.winvector.db.LoadFiles \
file:dbDef.xml , hus \
file:csv_hus/ss11husa.csv file:csv_hus/ss11husb.csv
java -classpath SQLScrewdriver.jar:h2-1.3.170.jar \
com.winvector.db.LoadFiles \
file:dbDef.xml , pus \
file:csv_pus/ss11pusa.csv file:csv_pus/ss11pusb.csv</code></pre>
<p>The jars SQLScrewdriver.jar h2-1.3.170.jar are available <a href="https://github.com/WinVector/zmPDSwR/tree/master/PUMS/loadExample">here</a>. Note: Windows users should use ";" to sparate jars instead of ":".</p>
<p>Where the XML file dbDef.xml is:</p>
<pre><code><?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>testdb</comment>
<entry key="user">u</entry>
<entry key="password">u</entry>
<entry key="driver">org.h2.Driver</entry>
<entry key="url">jdbc:h2:H2DB \
;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0</entry>
</properties></code></pre>
<p>And the follow-up R commands are:</p>
<pre><code>options( java.parameters = "-Xmx2g" )
library(RJDBC)
drv = JDBC("org.h2.Driver","h2-1.3.170.jar",identifier.quote="'")
options=";LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0"
conn = dbConnect(drv,paste("jdbc:h2:H2DB",options,sep=''),"u","u")
dhus = dbGetQuery(conn,"SELECT * FROM hus WHERE ORIGRANDGROUP<=1")
dpus = dbGetQuery(conn,"SELECT pus.* FROM pus WHERE pus.SERIALNO IN \
(SELECT DISTINCT hus.SERIALNO FROM hus WHERE hus.ORIGRANDGROUP<=1)")
dbDisconnect(conn)
save(dhus,dpus,file='phsample.RData')</code></pre>
<p>This produces phsample.RData (a already prepared copy of which is available <a href="https://github.com/WinVector/zmPDSwR/tree/master/PUMS">here</a>, so you can skip all of the above steps). Data can then be loaded with:</p>
<pre><code>load('phsample.RData')</code></pre>
<p>This loads dhus (2982 random households) and dpus (6279 personal records joined with the households from dhus).</p>
<h1 id="note-on-downloading-files">Note on downloading files</h1>
<p>A remote source could update files at any time (and even worse they may change links). The data we used in this book had the following cryptographic hashses (downloaded in March and February and 2013, and also matching on a June 4 2014 download):</p>
<ul>
<li>csv_hus.zip shasum: cdfdfb326956e202fdb560ee34471339ac8abd6c</li>
<li>csv_pus.zip shasum: aa0f4add21e327b96d9898b850e618aeca10f6d0</li>
<li>csv_hus/ss11husa.csv shasum: 72a0212e8313af83354460af1daeb54a7321f429</li>
<li>csv_hus/ss11husb.csv shasum: 1edaedb79f7766a9712fc0a8a7b674e427e5ba56</li>
<li>csv_pus/ss11pusa.csv shasum: 574d657d2094087722a054e3c16c277ece34f01c</li>
<li>csv_pus/ss11pusb.csv shasum: 89a6401524e2f25e02633849828462fa1204ea91</li>
</ul>