-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Benjamin Brooks, Avishek Kumar, Syed Ali Asad Rizvi
committed
Oct 6, 2016
0 parents
commit d59383c
Showing
68 changed files
with
4,048 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
data/ | ||
*.DS_Store | ||
sensitive.py | ||
*~ | ||
*.swp | ||
*.pyc | ||
.ipynb_checkpoints | ||
#*# | ||
secret_default_profile.yaml | ||
*.csv | ||
*.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
language: python | ||
python: | ||
- "2.7" | ||
|
||
before_install: | ||
- sudo apt-get install -qq python-numpy python-scipy python-matplotlib | ||
|
||
# command to install dependencies | ||
install: "pip install -r requirements.txt" | ||
|
||
# command to run tests | ||
script: python -m pytest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Syracuse | ||
[](https://travis-ci.com/dssg/syracuse) | ||
|
||
## About | ||
Syracuse is a city located in Onondaga County in Central New York. It has a | ||
population of about 150,000 and a metro area population of approximately | ||
665,000. The Syracuse Innovation Team is a Bloomberg Philanthropies funded | ||
office created in 2015. It was set up with a specific focus on solving | ||
infrastructure problems. The city has a rich history of innovation, but | ||
at this point, the government does not do a lot of work that relies on | ||
data to help make decisions. Having heard about the work that the Center | ||
for Data Science and Public Policy did with Cincinnati on proactive blight | ||
reduction during the 2015 Data Science for the Social Good Fellowship, | ||
the Syracuse Innovation Team reached out to DSaPP about participating in the | ||
2016 Fellowship. Infrastructure, particularly the state of water mains in the | ||
city, is especially important. Based on review of prior DSSG projects, the city | ||
believes that a partnership could be beneficial in pushing data-led initiatives | ||
forward, ultimately benefiting the infrastructure as a whole, as well as the residents. | ||
More information can be found | ||
[here](http://dssg.uchicago.edu/project/early-warning-system-for-water-infrastructure-problems/) | ||
|
||
## Project Overview | ||
This project entails designing and implementing a data-driven process to | ||
proacively address water main breaks and leaks. The ulimate goal is to predict | ||
areas where water mains are most at risk of breaking, and which features are the | ||
best for predicting a water main break (e.g, year laid, materials, soil composition). | ||
|
||
--- | ||
## Installation | ||
|
||
###Get the code | ||
``` | ||
git clone https://github.com/dssg/syracuse | ||
cd syracuse | ||
``` | ||
|
||
### Python Dependencies | ||
``` | ||
cd syracuse | ||
pip install -r requirements.txt | ||
``` | ||
|
||
###Database Configuration | ||
|
||
Database Type: *PostGreSQL 9.4* | ||
with PostGIS extension | ||
``` | ||
syracuse=> select PostGIS_full_version(); | ||
postgis_full_version | ||
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ||
POSTGIS="2.1.8 r13780" GEOS="3.5.0-CAPI-1.9.0 r4084" PROJ="Rel. 4.9.2, 08 September 2015" GDAL="GDAL 1.11.4, released 2016/01/25" LIBXML="2.9.1" LIBJSON="UNKNOWN" TOPOLOGY RASTER | ||
``` | ||
|
||
|
||
see database credential files | ||
*/model/config/secret_default_profile.yaml* | ||
Example | ||
``` | ||
PGPORT: 5432 | ||
PGHOST: "postgres.123fake.com" | ||
PGDATABASE: "123fake" | ||
PGPASSWORD: "123fake" | ||
``` | ||
--- | ||
|
||
##Load data into postges | ||
See the etl directory for details | ||
``` | ||
bash ./etl/do_etl.sh | ||
``` | ||
|
||
##Create features from the data | ||
See model/features directory for details | ||
``` | ||
bash ./model/features/do_features.sh | ||
``` | ||
--- | ||
|
||
##Run the modeling pipeline | ||
See model/README.md for details | ||
--- | ||
|
||
|
||
## Directory Structure | ||
``` | ||
. | ||
├── config | ||
├── descriptive_stats | ||
│ ├── mains_streets_stats | ||
│ └── water_work_orders | ||
├── etl | ||
│ ├── bin | ||
│ ├── geology | ||
│ ├── road_ratings | ||
│ ├── soil | ||
│ ├── street_line_data | ||
│ ├── tax_data | ||
│ ├── updated_main_data | ||
│ ├── waterorders | ||
│ └── water_system | ||
├── model | ||
│ ├── config | ||
│ ├── features | ||
│ └── log | ||
├── models_evaluation | ||
└── results | ||
└── figures | ||
``` | ||
|
||
|
||
## Low hanging fruit TODO | ||
- Implement a logger instead of print statements | ||
- Make sure package is python 3 compatible | ||
- Make more unit tests that test whole pipeline |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
PGPORT: | ||
PGHOST: "" | ||
PGDATABASE: "" | ||
PGPASSWORD: "" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
datafiles: | ||
excel_water_work_orders: /mnt/data/syracuse/raw/WaterWorkOrders_2004-2015.xlsx | ||
updated_water_work_orders: /mnt/data/syracuse/raw/Main_Breaks_And_Leaks_Geocoded.csv | ||
datadir: | ||
raw_data_dir : '/mnt/data/syracuse/raw' | ||
clean_data_dir : '/mnt/data/syracuse/clean_data' | ||
sensitive_data_dir: '/mnt/data/syracuse/sensitive' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
ETL Directory for loading data into PostgreSQL | ||
|
||
### Key files/folders | ||
|
||
- datafiles.yaml -- contains hard coded paths for datafiles | ||
- do_etl.sh -- walks through each subdirectory and run the bash script | ||
beginning with etl. | ||
- bin directory -- contains functions to convert DBF files to CSV, as well as importing shapefiles to PostgreSQL. | ||
|
||
### Data | ||
|
||
| Dirname | Type | Description | | ||
| ------------- |:-------------:|:-----| | ||
| geology | GIS | Geological composition data imported into the soil.geology table | | ||
| road_ratings | CSV | Syracuse road ratings by year import into the roads schema | | ||
| soil | GIS | Soil composition data imported into the soil schema | | ||
| street_line_data | GIS | Street lines file imported into the streets schema | | ||
| water_system | GIS | Several GIS layers describing the Syracuse water system imported into the water_system schema | | ||
| waterorders | Excel and CSV | Record of work orders from the water department from 2004-2016 | | ||
| create_tables | SQL | Script for creating tables in PostgreSQL database | | ||
| updated_main_data | DBF | Updated water main data provided by City of Syracuse based on extraction from logbooks | | ||
| tax_data | GIS | Tax parcel data from Onondaga County, including the age of the structure on each parcel | | ||
|
||
###Projections | ||
|
||
All projections are converted into the NYState Projection [SRID:2261](http://spatialreference.org/ref/epsg/2261/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
#!/usr/bin/env python | ||
""" | ||
Convert DBF table to CSV | ||
======================== | ||
Description | ||
----------- | ||
Converts a DBF table to CSV | ||
and outputs to stdout. | ||
Usage | ||
----- | ||
``` | ||
./dbfToCsV.py <dbfile> <schema> <table> | ||
``` | ||
""" | ||
import csv | ||
import sys | ||
from dbfread import DBF | ||
import sys | ||
import pandas as pd | ||
from sqlalchemy import create_engine | ||
|
||
def convert_to_df(dbf_file): | ||
""" | ||
Converts contents of dbf file | ||
to a DataFrame | ||
Input | ||
----- | ||
dbffile: str | ||
name of dbf file | ||
Output | ||
------ | ||
df: DataFrame | ||
Dataframe Object | ||
""" | ||
table = DBF(dbf_file) | ||
df = pd.DataFrame( iter(table) ) | ||
return df | ||
|
||
if __name__ == "__main__": | ||
if len(sys.argv) < 2: | ||
print __doc__ | ||
exit() | ||
|
||
dbf_file = sys.argv[1] | ||
df = convert_to_df(dbf_file) | ||
df.to_csv('temp.csv',index=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
#!/bin/bash | ||
#ETL script for importing shape files. | ||
|
||
usage="./etl_road_data.sh -y yaml_dir -s schema -t table -f shapefilename" | ||
|
||
if [ ${#} -eq 0 ] | ||
then | ||
echo ${usage} | ||
exit 1; | ||
fi | ||
|
||
function die () { | ||
# die errormessage [error_number] | ||
local errmsg="$1" errcode="${2:-1}" | ||
echo "ERROR: ${errmsg}" | ||
exit ${errcode} | ||
} | ||
|
||
|
||
#------------------------------------------------ | ||
# process inputs | ||
#------------------------------------------------- | ||
projection=2261 | ||
file="" #standard projection is NYState | ||
while getopts hp:y:s:t:f: OPT; do | ||
case "${OPT}" in | ||
h) echo "${usage}"; | ||
exit 0 | ||
;; | ||
p) projection="${OPTARG}" | ||
;; | ||
y) yaml_dir="${OPTARG}" | ||
;; | ||
s) schema="${OPTARG}" | ||
;; | ||
t) table="${OPTARG}" | ||
;; | ||
f) file="${OPTARG}" | ||
;; | ||
?) die "unknown option or missing arument; see -h for usage" 2 | ||
;; | ||
esac | ||
done | ||
echo "projection: ${projection}" | ||
echo "yaml_dir ${yaml_dir}" | ||
echo "schema ${schema}" | ||
echo "table ${table}" | ||
echo "file ${file}" | ||
|
||
|
||
dirname=$(grep ${yaml_dir} ./../datafiles.yaml|\ | ||
awk -F: '{print $2}'| sed s/\'//g | sed 's/ //g'); | ||
|
||
|
||
if [ -z ${file} ] | ||
then | ||
shapefile=$(ls "${dirname}" | grep ".shp$") | ||
else | ||
shapefile=$(basename ${file}) | ||
fi | ||
|
||
|
||
projection=2261 | ||
|
||
echo "shapefile: ${shapefile}" | ||
echo "dirname: ${dirname}" | ||
echo "projection: ${projection}" | ||
#check that there is only one shapefile | ||
|
||
|
||
num=$(echo ${shapefile} | wc -l ) | ||
if [ ${num} -ne 1 ] | ||
then | ||
echo "Should only be one shapefile, not ${num}"; | ||
exit 1; | ||
fi | ||
|
||
|
||
#create table and schema | ||
psql -c "drop table if exists ${schema}.${table}" | ||
psql -c "create schema if not exists ${schema}" | ||
|
||
#import the data | ||
shp2pgsql -s ${projection} -d ${dirname}/${shapefile} ${schema}.${table} | psql |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
datafiles: | ||
excel_water_work_orders: /mnt/data/syracuse/raw/WaterWorkOrders_2004-2015.xlsx | ||
updated_water_work_orders: /mnt/data/syracuse/raw/Main_Breaks_And_Leaks_Geocoded.csv | ||
datadir: | ||
raw_data_dir : '/mnt/data/syracuse/raw' | ||
clean_data_dir : '/mnt/data/syracuse/clean_data' | ||
street_lines_dir: '/mnt/data/syracuse/raw/Street_Line_files' | ||
city_tax_dir: '/mnt/data/syracuse/raw/SyracuseTaxTables/Syracuse_City_Tax' | ||
water_services_dir: '/mnt/data/syracuse/raw/SyracuseTaxTables/water_services' | ||
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/bin/bash | ||
#do the whole etl process for the | ||
#syracuse project | ||
|
||
#this script find all the bash scripts | ||
#that start with etl and end in .sh | ||
#then cds into that directory executes | ||
#the script and then moves on to the next directory | ||
|
||
eval $(cat model/config/secret_default_profile.yaml | sed 's/^/export /' | sed 's/: /=/') | ||
|
||
for script in $(find ./ -name 'etl*.sh') | ||
do | ||
echo ${script}; | ||
DIR=$(dirname "${script}") | ||
cd ${DIR} | ||
bash etl*.sh | ||
cd - | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/bin/sh | ||
|
||
# Data downloaded from this website: http://mrdata.usgs.gov/geology/state/state.php?state=NY | ||
DATA_DIR="/mnt/data/syracuse/raw/NYgeol_dd/" | ||
shp2pgsql -d -s 4267:2261 ${DATA_DIR}nygeol_poly_dd.shp soil.geology | psql |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/bin/bash | ||
|
||
python wrangle_road_ratings.py | ||
psql -f road_ratings_import.sql | ||
# Calls to Google API takes time -- commenting out geocoding, but can be rerun if desired. | ||
python geocode_road_ratings.py | ||
psql -f generate_road_rating_geom.sql |
Oops, something went wrong.