Skip to content

Commit

Permalink
DSSG Summer 2016 Work
Browse files Browse the repository at this point in the history
  • Loading branch information
Benjamin Brooks, Avishek Kumar, Syed Ali Asad Rizvi committed Oct 6, 2016
0 parents commit d59383c
Show file tree
Hide file tree
Showing 68 changed files with 4,048 additions and 0 deletions.
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
data/
*.DS_Store
sensitive.py
*~
*.swp
*.pyc
.ipynb_checkpoints
#*#
secret_default_profile.yaml
*.csv
*.png
12 changes: 12 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
language: python
python:
- "2.7"

before_install:
- sudo apt-get install -qq python-numpy python-scipy python-matplotlib

# command to install dependencies
install: "pip install -r requirements.txt"

# command to run tests
script: python -m pytest
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Syracuse
[![Build Status](https://travis-ci.com/dssg/syracuse.svg?token=qr1WKDpoEiNDipEKFzrb&branch=master)](https://travis-ci.com/dssg/syracuse)

## About
Syracuse is a city located in Onondaga County in Central New York. It has a
population of about 150,000 and a metro area population of approximately
665,000. The Syracuse Innovation Team is a Bloomberg Philanthropies funded
office created in 2015. It was set up with a specific focus on solving
infrastructure problems. The city has a rich history of innovation, but
at this point, the government does not do a lot of work that relies on
data to help make decisions. Having heard about the work that the Center
for Data Science and Public Policy did with Cincinnati on proactive blight
reduction during the 2015 Data Science for the Social Good Fellowship,
the Syracuse Innovation Team reached out to DSaPP about participating in the
2016 Fellowship. Infrastructure, particularly the state of water mains in the
city, is especially important. Based on review of prior DSSG projects, the city
believes that a partnership could be beneficial in pushing data-led initiatives
forward, ultimately benefiting the infrastructure as a whole, as well as the residents.
More information can be found
[here](http://dssg.uchicago.edu/project/early-warning-system-for-water-infrastructure-problems/)

## Project Overview
This project entails designing and implementing a data-driven process to
proacively address water main breaks and leaks. The ulimate goal is to predict
areas where water mains are most at risk of breaking, and which features are the
best for predicting a water main break (e.g, year laid, materials, soil composition).

---
## Installation

###Get the code
```
git clone https://github.com/dssg/syracuse
cd syracuse
```

### Python Dependencies
```
cd syracuse
pip install -r requirements.txt
```

###Database Configuration

Database Type: *PostGreSQL 9.4*
with PostGIS extension
```
syracuse=> select PostGIS_full_version();
postgis_full_version
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
POSTGIS="2.1.8 r13780" GEOS="3.5.0-CAPI-1.9.0 r4084" PROJ="Rel. 4.9.2, 08 September 2015" GDAL="GDAL 1.11.4, released 2016/01/25" LIBXML="2.9.1" LIBJSON="UNKNOWN" TOPOLOGY RASTER
```


see database credential files
*/model/config/secret_default_profile.yaml*
Example
```
PGPORT: 5432
PGHOST: "postgres.123fake.com"
PGDATABASE: "123fake"
PGPASSWORD: "123fake"
```
---

##Load data into postges
See the etl directory for details
```
bash ./etl/do_etl.sh
```

##Create features from the data
See model/features directory for details
```
bash ./model/features/do_features.sh
```
---

##Run the modeling pipeline
See model/README.md for details
---


## Directory Structure
```
.
├── config
├── descriptive_stats
│   ├── mains_streets_stats
│   └── water_work_orders
├── etl
│   ├── bin
│   ├── geology
│   ├── road_ratings
│   ├── soil
│   ├── street_line_data
│   ├── tax_data
│   ├── updated_main_data
│   ├── waterorders
│   └── water_system
├── model
│   ├── config
│   ├── features
│   └── log
├── models_evaluation
└── results
└── figures
```


## Low hanging fruit TODO
- Implement a logger instead of print statements
- Make sure package is python 3 compatible
- Make more unit tests that test whole pipeline
4 changes: 4 additions & 0 deletions config/example_default_profile.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
PGPORT:
PGHOST: ""
PGDATABASE: ""
PGPASSWORD: ""
7 changes: 7 additions & 0 deletions datafiles.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
datafiles:
excel_water_work_orders: /mnt/data/syracuse/raw/WaterWorkOrders_2004-2015.xlsx
updated_water_work_orders: /mnt/data/syracuse/raw/Main_Breaks_And_Leaks_Geocoded.csv
datadir:
raw_data_dir : '/mnt/data/syracuse/raw'
clean_data_dir : '/mnt/data/syracuse/clean_data'
sensitive_data_dir: '/mnt/data/syracuse/sensitive'
26 changes: 26 additions & 0 deletions etl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
ETL Directory for loading data into PostgreSQL

### Key files/folders

- datafiles.yaml -- contains hard coded paths for datafiles
- do_etl.sh -- walks through each subdirectory and run the bash script
beginning with etl.
- bin directory -- contains functions to convert DBF files to CSV, as well as importing shapefiles to PostgreSQL.

### Data

| Dirname | Type | Description |
| ------------- |:-------------:|:-----|
| geology | GIS | Geological composition data imported into the soil.geology table |
| road_ratings | CSV | Syracuse road ratings by year import into the roads schema |
| soil | GIS | Soil composition data imported into the soil schema |
| street_line_data | GIS | Street lines file imported into the streets schema |
| water_system | GIS | Several GIS layers describing the Syracuse water system imported into the water_system schema |
| waterorders | Excel and CSV | Record of work orders from the water department from 2004-2016 |
| create_tables | SQL | Script for creating tables in PostgreSQL database |
| updated_main_data | DBF | Updated water main data provided by City of Syracuse based on extraction from logbooks |
| tax_data | GIS | Tax parcel data from Onondaga County, including the age of the structure on each parcel |

###Projections

All projections are converted into the NYState Projection [SRID:2261](http://spatialreference.org/ref/epsg/2261/)
50 changes: 50 additions & 0 deletions etl/bin/dbfToCsv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/usr/bin/env python
"""
Convert DBF table to CSV
========================
Description
-----------
Converts a DBF table to CSV
and outputs to stdout.
Usage
-----
```
./dbfToCsV.py <dbfile> <schema> <table>
```
"""
import csv
import sys
from dbfread import DBF
import sys
import pandas as pd
from sqlalchemy import create_engine

def convert_to_df(dbf_file):
"""
Converts contents of dbf file
to a DataFrame
Input
-----
dbffile: str
name of dbf file
Output
------
df: DataFrame
Dataframe Object
"""
table = DBF(dbf_file)
df = pd.DataFrame( iter(table) )
return df

if __name__ == "__main__":
if len(sys.argv) < 2:
print __doc__
exit()

dbf_file = sys.argv[1]
df = convert_to_df(dbf_file)
df.to_csv('temp.csv',index=False)
84 changes: 84 additions & 0 deletions etl/bin/load_shapefiles.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/bin/bash
#ETL script for importing shape files.

usage="./etl_road_data.sh -y yaml_dir -s schema -t table -f shapefilename"

if [ ${#} -eq 0 ]
then
echo ${usage}
exit 1;
fi

function die () {
# die errormessage [error_number]
local errmsg="$1" errcode="${2:-1}"
echo "ERROR: ${errmsg}"
exit ${errcode}
}


#------------------------------------------------
# process inputs
#-------------------------------------------------
projection=2261
file="" #standard projection is NYState
while getopts hp:y:s:t:f: OPT; do
case "${OPT}" in
h) echo "${usage}";
exit 0
;;
p) projection="${OPTARG}"
;;
y) yaml_dir="${OPTARG}"
;;
s) schema="${OPTARG}"
;;
t) table="${OPTARG}"
;;
f) file="${OPTARG}"
;;
?) die "unknown option or missing arument; see -h for usage" 2
;;
esac
done
echo "projection: ${projection}"
echo "yaml_dir ${yaml_dir}"
echo "schema ${schema}"
echo "table ${table}"
echo "file ${file}"


dirname=$(grep ${yaml_dir} ./../datafiles.yaml|\
awk -F: '{print $2}'| sed s/\'//g | sed 's/ //g');


if [ -z ${file} ]
then
shapefile=$(ls "${dirname}" | grep ".shp$")
else
shapefile=$(basename ${file})
fi


projection=2261

echo "shapefile: ${shapefile}"
echo "dirname: ${dirname}"
echo "projection: ${projection}"
#check that there is only one shapefile


num=$(echo ${shapefile} | wc -l )
if [ ${num} -ne 1 ]
then
echo "Should only be one shapefile, not ${num}";
exit 1;
fi


#create table and schema
psql -c "drop table if exists ${schema}.${table}"
psql -c "create schema if not exists ${schema}"

#import the data
shp2pgsql -s ${projection} -d ${dirname}/${shapefile} ${schema}.${table} | psql
10 changes: 10 additions & 0 deletions etl/datafiles.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
datafiles:
excel_water_work_orders: /mnt/data/syracuse/raw/WaterWorkOrders_2004-2015.xlsx
updated_water_work_orders: /mnt/data/syracuse/raw/Main_Breaks_And_Leaks_Geocoded.csv
datadir:
raw_data_dir : '/mnt/data/syracuse/raw'
clean_data_dir : '/mnt/data/syracuse/clean_data'
street_lines_dir: '/mnt/data/syracuse/raw/Street_Line_files'
city_tax_dir: '/mnt/data/syracuse/raw/SyracuseTaxTables/Syracuse_City_Tax'
water_services_dir: '/mnt/data/syracuse/raw/SyracuseTaxTables/water_services'
...
19 changes: 19 additions & 0 deletions etl/do_etl.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
#do the whole etl process for the
#syracuse project

#this script find all the bash scripts
#that start with etl and end in .sh
#then cds into that directory executes
#the script and then moves on to the next directory

eval $(cat model/config/secret_default_profile.yaml | sed 's/^/export /' | sed 's/: /=/')

for script in $(find ./ -name 'etl*.sh')
do
echo ${script};
DIR=$(dirname "${script}")
cd ${DIR}
bash etl*.sh
cd -
done
5 changes: 5 additions & 0 deletions etl/geology/etl_geology.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/sh

# Data downloaded from this website: http://mrdata.usgs.gov/geology/state/state.php?state=NY
DATA_DIR="/mnt/data/syracuse/raw/NYgeol_dd/"
shp2pgsql -d -s 4267:2261 ${DATA_DIR}nygeol_poly_dd.shp soil.geology | psql
7 changes: 7 additions & 0 deletions etl/road_ratings/etl_road_ratings.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

python wrangle_road_ratings.py
psql -f road_ratings_import.sql
# Calls to Google API takes time -- commenting out geocoding, but can be rerun if desired.
python geocode_road_ratings.py
psql -f generate_road_rating_geom.sql
Loading

0 comments on commit d59383c

Please sign in to comment.