DSSG Summer 2016 Work

dssg · Oct 6, 2016 · d59383c · d59383c
commit d59383c
Show file tree

Hide file tree

Showing 68 changed files with 4,048 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,11 @@
+data/
+*.DS_Store
+sensitive.py
+*~
+*.swp
+*.pyc
+.ipynb_checkpoints
+#*#
+secret_default_profile.yaml
+*.csv
+*.png
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,12 @@
+language: python
+python:
+  - "2.7"
+
+before_install:
+    - sudo apt-get install -qq python-numpy python-scipy python-matplotlib
+
+# command to install dependencies
+install: "pip install -r requirements.txt"
+
+# command to run tests
+script: python -m pytest
diff --git a/README.md b/README.md
@@ -0,0 +1,115 @@
+# Syracuse
+[![Build Status](https://travis-ci.com/dssg/syracuse.svg?token=qr1WKDpoEiNDipEKFzrb&branch=master)](https://travis-ci.com/dssg/syracuse)
+
+## About
+Syracuse is a city located in Onondaga County in Central New York. It has a
+population of about 150,000 and a metro area population of approximately
+665,000. The Syracuse Innovation Team is a Bloomberg Philanthropies funded
+office created in 2015. It was set up with a specific focus on solving
+infrastructure problems. The city has a rich history of innovation, but
+at this point, the government does not do a lot of work that relies on
+data to help make decisions. Having heard about the work that the Center
+for Data Science and Public Policy did with Cincinnati on proactive blight
+reduction during the 2015 Data Science for the Social Good Fellowship,
+the Syracuse Innovation Team reached out to DSaPP about participating in the
+2016 Fellowship. Infrastructure, particularly the state of water mains in the
+city, is especially important. Based on review of prior DSSG projects, the city
+believes that a partnership could be beneficial in pushing data-led initiatives
+forward, ultimately benefiting the infrastructure as a whole, as well as the residents.
+More information can be found
+[here](http://dssg.uchicago.edu/project/early-warning-system-for-water-infrastructure-problems/)
+
+## Project Overview
+This project entails designing and implementing a data-driven process to
+proacively address water main breaks and leaks. The ulimate goal is to predict
+areas where water mains are most at risk of breaking, and which features are the
+best for predicting a water main break (e.g, year laid, materials, soil composition).
+
+---
+## Installation
+
+###Get the code
+```
+git clone https://github.com/dssg/syracuse
+cd syracuse
+```
+
+### Python Dependencies
+```
+cd syracuse
+pip install -r requirements.txt
+```
+
+###Database Configuration
+
+Database Type: *PostGreSQL 9.4*
+with PostGIS extension
+```
+syracuse=> select PostGIS_full_version();
+                                                                                postgis_full_version
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ POSTGIS="2.1.8 r13780" GEOS="3.5.0-CAPI-1.9.0 r4084" PROJ="Rel. 4.9.2, 08 September 2015" GDAL="GDAL 1.11.4, released 2016/01/25" LIBXML="2.9.1" LIBJSON="UNKNOWN" TOPOLOGY RASTER
+```
+
+
+see database credential files
+*/model/config/secret_default_profile.yaml*
+Example
+```
+PGPORT: 5432
+PGHOST: "postgres.123fake.com"
+PGDATABASE: "123fake"
+PGPASSWORD: "123fake"
+```
+---
+
+##Load data into postges
+See the etl directory for details
+```
+bash ./etl/do_etl.sh
+```
+
+##Create features from the data
+See model/features directory for details
+```
+bash ./model/features/do_features.sh
+```
+---
+
+##Run the modeling pipeline
+See model/README.md for details
+---
+
+
+## Directory Structure
+```
+.
+├── config
+├── descriptive_stats
+│   ├── mains_streets_stats
+│   └── water_work_orders
+├── etl
+│   ├── bin
+│   ├── geology
+│   ├── road_ratings
+│   ├── soil
+│   ├── street_line_data
+│   ├── tax_data
+│   ├── updated_main_data
+│   ├── waterorders
+│   └── water_system
+├── model
+│   ├── config
+│   ├── features
+│   └── log
+├── models_evaluation
+└── results
+    └── figures
+
+```
+
+
+## Low hanging fruit TODO
+- Implement a logger instead of print statements
+- Make sure package is python 3 compatible
+- Make more unit tests that test whole pipeline
diff --git a/config/example_default_profile.yaml b/config/example_default_profile.yaml
@@ -0,0 +1,4 @@
+PGPORT:
+PGHOST: ""
+PGDATABASE: ""
+PGPASSWORD: ""
diff --git a/datafiles.yaml b/datafiles.yaml
@@ -0,0 +1,7 @@
+datafiles:
+    excel_water_work_orders: /mnt/data/syracuse/raw/WaterWorkOrders_2004-2015.xlsx
+    updated_water_work_orders: /mnt/data/syracuse/raw/Main_Breaks_And_Leaks_Geocoded.csv
+datadir:
+    raw_data_dir : '/mnt/data/syracuse/raw'
+    clean_data_dir : '/mnt/data/syracuse/clean_data'
+    sensitive_data_dir: '/mnt/data/syracuse/sensitive'
diff --git a/etl/README.md b/etl/README.md
@@ -0,0 +1,26 @@
+ETL Directory for loading data into PostgreSQL
+
+### Key files/folders
+
+- datafiles.yaml -- contains hard coded paths for datafiles
+- do_etl.sh -- walks through each subdirectory and run the bash script
+beginning with etl.
+- bin directory -- contains functions to convert DBF files to CSV, as well as importing shapefiles to PostgreSQL.
+
+### Data
+
+| Dirname        | Type           | Description  |
+| ------------- |:-------------:|:-----|
+| geology      | GIS | Geological composition data imported into the soil.geology table |
+| road_ratings | CSV | Syracuse road ratings by year import into the roads schema |
+| soil         | GIS |  Soil composition data imported into the soil schema |
+| street_line_data | GIS | Street lines file imported into the streets schema |
+| water_system | GIS | Several GIS layers describing the Syracuse water system imported into the water_system schema |
+| waterorders | Excel and CSV | Record of work orders from the water department from 2004-2016 |
+| create_tables | SQL | Script for creating tables in PostgreSQL database |
+| updated_main_data | DBF | Updated water main data provided by City of Syracuse based on extraction from logbooks |
+| tax_data | GIS | Tax parcel data from Onondaga County, including the age of the structure on each parcel |
+
+###Projections
+
+All projections are converted into the NYState Projection [SRID:2261](http://spatialreference.org/ref/epsg/2261/)
diff --git a/etl/bin/dbfToCsv.py b/etl/bin/dbfToCsv.py
@@ -0,0 +1,50 @@
+#!/usr/bin/env python
+"""
+Convert DBF table to CSV
+========================
+
+Description
+-----------
+Converts a DBF table to CSV
+and outputs to stdout.
+
+Usage
+-----
+```
+./dbfToCsV.py <dbfile> <schema> <table>
+```
+"""
+import csv
+import sys
+from dbfread import DBF
+import sys
+import pandas as pd
+from sqlalchemy import create_engine
+
+def convert_to_df(dbf_file):
+    """
+    Converts contents of dbf file
+    to a DataFrame
+
+    Input
+    -----
+    dbffile: str
+       name of dbf file
+
+    Output
+    ------
+    df: DataFrame
+       Dataframe Object
+    """
+    table = DBF(dbf_file)
+    df = pd.DataFrame( iter(table) )
+    return df
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print __doc__
+        exit()
+
+    dbf_file = sys.argv[1]
+    df = convert_to_df(dbf_file)
+    df.to_csv('temp.csv',index=False)
diff --git a/etl/bin/load_shapefiles.sh b/etl/bin/load_shapefiles.sh
@@ -0,0 +1,84 @@
+#!/bin/bash
+#ETL script for importing shape files.
+
+usage="./etl_road_data.sh -y yaml_dir -s schema -t table -f shapefilename"
+
+if [ ${#} -eq 0 ]
+then
+    echo ${usage}
+    exit 1;
+fi
+
+function die () {
+# die errormessage [error_number]
+local errmsg="$1" errcode="${2:-1}"
+echo "ERROR: ${errmsg}"
+exit ${errcode}
+}
+
+
+#------------------------------------------------
+# process inputs
+#-------------------------------------------------
+projection=2261
+file="" #standard projection is NYState
+while getopts hp:y:s:t:f: OPT; do
+case "${OPT}" in
+h)  echo "${usage}";
+exit 0
+;;
+p)  projection="${OPTARG}"
+;;
+y)  yaml_dir="${OPTARG}"
+;;
+s)  schema="${OPTARG}"
+;;
+t)  table="${OPTARG}"
+;;
+f)  file="${OPTARG}"
+;;
+?)  die "unknown option or missing arument; see -h for usage" 2
+;;
+esac
+done
+echo "projection: ${projection}"
+echo "yaml_dir ${yaml_dir}"
+echo "schema ${schema}"
+echo "table ${table}"
+echo "file ${file}"
+
+
+dirname=$(grep ${yaml_dir} ./../datafiles.yaml|\
+ awk -F: '{print $2}'| sed  s/\'//g | sed 's/ //g');
+
+
+if [ -z ${file} ]
+then
+    shapefile=$(ls "${dirname}" | grep ".shp$")
+else
+    shapefile=$(basename ${file})
+fi
+
+
+projection=2261
+
+echo "shapefile: ${shapefile}"
+echo "dirname: ${dirname}"
+echo "projection: ${projection}"
+#check that there is only one shapefile
+
+
+num=$(echo ${shapefile} | wc -l )
+if [ ${num} -ne 1 ]
+then
+    echo "Should only be one shapefile, not ${num}";
+    exit 1;
+fi
+
+
+#create table and schema
+psql -c "drop table if exists ${schema}.${table}"
+psql -c "create schema if not exists ${schema}"
+
+#import the data
+shp2pgsql -s ${projection} -d ${dirname}/${shapefile} ${schema}.${table} | psql
diff --git a/etl/datafiles.yaml b/etl/datafiles.yaml
@@ -0,0 +1,10 @@
+datafiles:
+    excel_water_work_orders: /mnt/data/syracuse/raw/WaterWorkOrders_2004-2015.xlsx
+    updated_water_work_orders: /mnt/data/syracuse/raw/Main_Breaks_And_Leaks_Geocoded.csv
+datadir:
+    raw_data_dir : '/mnt/data/syracuse/raw'
+    clean_data_dir : '/mnt/data/syracuse/clean_data'
+    street_lines_dir: '/mnt/data/syracuse/raw/Street_Line_files'
+    city_tax_dir: '/mnt/data/syracuse/raw/SyracuseTaxTables/Syracuse_City_Tax'
+    water_services_dir: '/mnt/data/syracuse/raw/SyracuseTaxTables/water_services'
+...
diff --git a/etl/do_etl.sh b/etl/do_etl.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+#do the whole etl process for the
+#syracuse project
+
+#this script find all the bash scripts
+#that start with etl and end in .sh
+#then cds into that directory executes
+#the script and then moves on to the next directory
+
+eval $(cat model/config/secret_default_profile.yaml | sed 's/^/export /' | sed 's/: /=/')
+
+for script in $(find ./ -name 'etl*.sh')
+do
+    echo ${script};
+    DIR=$(dirname "${script}")
+    cd ${DIR}
+    bash etl*.sh
+    cd -
+done
diff --git a/etl/geology/etl_geology.sh b/etl/geology/etl_geology.sh
@@ -0,0 +1,5 @@
+#!/bin/sh
+
+# Data downloaded from this website: http://mrdata.usgs.gov/geology/state/state.php?state=NY
+DATA_DIR="/mnt/data/syracuse/raw/NYgeol_dd/"
+shp2pgsql -d -s 4267:2261 ${DATA_DIR}nygeol_poly_dd.shp soil.geology | psql
diff --git a/etl/road_ratings/etl_road_ratings.sh b/etl/road_ratings/etl_road_ratings.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+python wrangle_road_ratings.py
+psql -f road_ratings_import.sql
+# Calls to Google API takes time -- commenting out geocoding, but can be rerun if desired.
+python geocode_road_ratings.py
+psql -f generate_road_rating_geom.sql