-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathStormDataAnalysis.rmd
374 lines (299 loc) · 15.2 KB
/
StormDataAnalysis.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: "Reproducible Research: Peer-Assessed Project 2"
author: "Christopher Jones"
date: "11/16/2017"
output:
html_document:
echo: true
code_folding: hide
keep_md: true
---
## Historical US Storm Data Analysis:
## A survey of historical storm data with respect to population health and economic consequences {.tabset}
### Synopsis
We examine historical US storm data from NOAA covering 1950-2011, to assess types of events that are most harmful to human health, and also which types of events have the greatest economic consequences. Many nominally distinct event types are similar (e.g. different types of floods). We here report the storm event types as presented in the data, and do not go beyond the data by attempting to amalgamate similar types.
Primary findings:
* Tornadoes are the largest historical contributor to weather event fatalities, at 37% of the 15,145 total.
* Tornadoes are also the largest historical cause of weather event injury, at 65% of the 140,528 total.
* Drought is the largest cause of crop damage at almost ~$14 billion of the ~$49 billion total.
* Flooding is the largest cause of property damage at ~$144.5 billion of the ~427 billion total.
* A fuller analysis would need to take into account variables like geography, date, age, gender, and currency depreciation.
### Data Loading/Processing
#### Data Loading:
Loading the data is simple; the code is more lengthy because it is general-purpose, checking for the minimal amount of work to be done to load (need download? need unzip? only need to read the file?).
First we load the packages to be used in this report.
```{r package_load}
# packages we'll be using
library(plyr)
library(ggplot2)
library(grid)
library(gridBase)
library(treemap)
library(scales)
library(formattable)
```
Next we load the storm data into memory. This code includes multiple checks to load the data with a minimum of effort.
```{r data_load}
# =======================
# Data loader
# =======================
# download data if it isn't already in place in current directory
filename <- "repdata_data_StormData"
extension.data <- ".csv"
extension.zip <- ".bz2"
zipurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(paste(filename, extension.data, sep='')))
{
message("Data file not found, checking for data zip file
")
if (!file.exists(paste(c(filename, extension.data, extension.zip), collapse='')))
{
message("Data zip file not found. Downloading.")
download <- download.file(zipurl, destfile=paste(c(filename, extension.data, extension.zip), collapse=''))
foundzip <- TRUE
message("Data zip file downloaded.")
} else
{
foundzip <- TRUE
}
if (foundzip)
{
message("Found data zip file, unzipping.")
library(R.utils)
bunzip2(paste(c(filename, extension.data, extension.zip), collapse='')
, paste(c(filename, extension.data), collapse='')
, remove = FALSE
, overwrite = TRUE
)
unlink(paste(filename, extension.zip, sep=''))
message("Data zip file unzipped.")
}
if (file.exists(paste(filename, extension.data, sep='')))
{
message("Data file exists, ready for preprocessing")
} else
{
stop("Data file not found, reason unknown. Halting execution.")
}
}
# read data only if not already loaded
readdata <- TRUE
if (exists("data.raw"))
{
if(nrow(data.raw) == 902297) # hardcoding numbers is bad m'kay?
{
readdata <- FALSE
}
}
if (readdata)
{
message("Reading data file:")
data.raw <- read.csv(paste(filename, extension.data, sep=''), header=TRUE)
}
message("Raw data loaded.")
```
#### Data Processing:
Next we perform data processing. For the multiple output analyses required, this is somewhat lengthy, but not complicated.
The following steps are performed in the below code chunks:
1. Subset the data with only the fields needed for this project
2. For the economic damage, convert order of magnitude exponents into complete dollar amounts
3. Summarize into aggregate totals
4. There are many (~1,000) distinct storm events, to to aid the user in focussing on the highest impact events, we do the following:
+ For economic analysis, remove events which have no property or crop damage
+ For health analysis, remove events which have no deaths or injuries
+ For each analysis, order the events by descending dollar/fatality/injury value, and take events from the top until a THRESHHOLD of 85% of the total loss is reached. The remaining events are then lumped into an Other category and included in the output. This THRESHHOLD computation is implemented by the creation of running total/percentage columns to the aggregated data.
Processing common to both health and economic analyses:
```{r data_processing_common}
#===========================
# Data processing - common
#===========================
# project out the columns we're interested in
data.proc <- data.raw[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
```
Processing for only economic analyses:
```{r data_processing_economic}
# perform order of magnitude calculation to get actual dollar amounts for prop & crop damage
data.proc$prop_dmg <- data.proc$PROPDMG * sapply(data.proc$PROPDMGEXP, function(str_exp) {switch(tolower(str_exp), "h" = 100, "k" = 1000, "m" = 1000000, "b" = 1000000000, if (is.numeric(str_exp)) 10 ** as.numeric(str_exp) else 1)})
data.proc$crop_dmg <- data.proc$CROPDMG * sapply(data.proc$CROPDMGEXP, function(str_exp) {switch(tolower(str_exp), "h" = 100, "k" = 1000, "m" = 1000000, "b" = 1000000000, if (is.numeric(str_exp)) 10 ** as.numeric(str_exp) else 1)})
# create totals by event for prop & crop damage value
econ_dmg <- ddply(data.proc, .(EVTYPE), summarize, prop_dmg = sum(prop_dmg), crop_dmg = sum(crop_dmg))
# only interested in events with damage
econ_dmg <- econ_dmg[(econ_dmg$prop_dmg > 0 | econ_dmg$crop_dmg > 0), ]
# create running total and % of total
propDmgSorted <- econ_dmg[order(econ_dmg$prop_dmg, decreasing = T), ]
propDmgSorted$cum <- cumsum(propDmgSorted$prop_dmg)
propDmgSorted$running_pct <- propDmgSorted$cum / sum(propDmgSorted$prop_dmg)
cropDmgSorted <- econ_dmg[order(econ_dmg$crop_dmg, decreasing = T), ]
cropDmgSorted$cum <- cumsum(cropDmgSorted$crop_dmg)
cropDmgSorted$running_pct <- cropDmgSorted$cum / sum(cropDmgSorted$crop_dmg)
head(propDmgSorted[, c("EVTYPE", "prop_dmg")], 5)
head(cropDmgSorted[, c("EVTYPE", "crop_dmg")], 5)
# we take 90% of total damage as threshhold to include in report,
# and create an Other event category for the rest.
threshhold <- .85
# property damage report data
propDmgReportData <- propDmgSorted[propDmgSorted$running_pct <= threshhold, ]
propDmgReportData <- rbind(propDmgReportData, c("Other"
, sum(propDmgSorted$prop_dmg) - max(propDmgReportData$cum)
, 0
, sum(propDmgSorted$prop_dmg)
, 1
)
)
propDmgReportData$prop_dmg <- as.numeric(propDmgReportData$prop_dmg)
propDmgReportData$crop_dmg <- as.numeric(propDmgReportData$crop_dmg)
propDmgReportData$cum <- as.numeric(propDmgReportData$cum)
propDmgReportData$running_pct <- as.numeric(propDmgReportData$running_pct)
other_count_prop <- as.character(nrow(propDmgSorted) - nrow(propDmgReportData) + 1)
propDmgReportData$event_pct <- paste(ifelse(propDmgReportData$EVTYPE != "Other", as.character(propDmgReportData$EVTYPE), paste("Other", paste("(", other_count_prop, ")", sep=""), sep=" "))
, currency(propDmgReportData$prop_dmg, digits=0L)
, percent(propDmgReportData$prop_dmg / sum(propDmgReportData$prop_dmg))
, sep="\n"
)
# crop damage report data
cropDmgReportData <- cropDmgSorted[cropDmgSorted$running_pct <= threshhold, ]
cropDmgReportData <- rbind(cropDmgReportData, c("Other"
, 0
, sum(cropDmgSorted$crop_dmg) - max(cropDmgReportData$cum)
, sum(cropDmgSorted$crop_dmg)
, 1
)
)
cropDmgReportData$prop_dmg <- as.numeric(cropDmgReportData$prop_dmg)
cropDmgReportData$crop_dmg <- as.numeric(cropDmgReportData$crop_dmg)
cropDmgReportData$cum <- as.numeric(cropDmgReportData$cum)
cropDmgReportData$running_pct <- as.numeric(cropDmgReportData$running_pct)
other_count_crop <- as.character(nrow(cropDmgSorted) - nrow(cropDmgReportData) + 1)
cropDmgReportData$event_pct <- paste(ifelse(cropDmgReportData$EVTYPE != "Other", as.character(cropDmgReportData$EVTYPE), paste("Other", paste("(", other_count_crop, ")", sep=""), sep=" "))
, currency(cropDmgReportData$crop_dmg, digits=0L)
, percent(cropDmgReportData$crop_dmg / sum(cropDmgReportData$crop_dmg))
, sep="\n"
)
message("Economic damage report data created.")
```
Processing for only health damage:
```{r data_processing_health}
# create totals by event for health damage
health_dmg <- ddply(data.proc, .(EVTYPE), summarize, fat = sum(FATALITIES), inj = sum(INJURIES))
# only interested in events with deaths or injuries
health_dmg <- health_dmg[(health_dmg$fat > 0 | health_dmg$inj > 0), ]
# create running total and % of total
fatSorted <- health_dmg[order(health_dmg$fat, decreasing = T), ]
injSorted <- health_dmg[order(health_dmg$inj, decreasing = T), ]
fatSorted$cum <- cumsum(fatSorted$fat)
injSorted$cum <- cumsum(injSorted$inj)
fatSorted$running_pct <- fatSorted$cum / sum(fatSorted$fat)
injSorted$running_pct <- injSorted$cum / sum(injSorted$inj)
head(fatSorted[, c("EVTYPE", "fat")], 5)
head(injSorted[, c("EVTYPE", "inj")], 5)
# we take the events totalling 85% of the fatalties/injuries
th_health <- .85
# fatality report data
fatReportData <- fatSorted[fatSorted$running_pct <= th_health, ]
fatReportData <- rbind(fatReportData, c("Other"
, sum(fatSorted$fat) - max(fatReportData$cum)
, 0
, sum(fatSorted$fat)
, 1
)
)
fatReportData$fat <- as.numeric(fatReportData$fat)
fatReportData$inj <- as.numeric(fatReportData$inj)
fatReportData$cum <- as.numeric(fatReportData$cum)
fatReportData$running_pct <- as.numeric(fatReportData$running_pct)
other_count_fat <- as.character(nrow(fatSorted) - nrow(fatReportData) + 1)
fatReportData$event_pct <- paste(ifelse(fatReportData$EVTYPE != "Other", as.character(fatReportData$EVTYPE), paste("Other", paste("(", other_count_fat, ")", sep=""), sep=" "))
, formatC(fatReportData$fat, format="d", big.mark=",")
, percent(fatReportData$fat / sum(fatReportData$fat))
, sep="\n"
)
# injury report data
injReportData <- injSorted[injSorted$running_pct <= th_health, ]
injReportData <- rbind(injReportData, c("Other"
, 0
, sum(injSorted$inj) - max(injReportData$cum)
, sum(injSorted$inj)
, 1
)
)
injReportData$fat <- as.numeric(injReportData$fat)
injReportData$inj <- as.numeric(injReportData$inj)
injReportData$cum <- as.numeric(injReportData$cum)
injReportData$running_pct <- as.numeric(injReportData$running_pct)
other_count_inj <- as.character(nrow(injSorted) - nrow(injReportData) + 1)
injReportData$event_pct <- paste(ifelse(injReportData$EVTYPE != "Other", as.character(injReportData$EVTYPE), paste("Other", paste("(", other_count_inj, ")", sep=""), sep=" "))
, formatC(injReportData$inj, format="d", big.mark=",")
, percent(injReportData$inj / sum(injReportData$inj))
, sep="\n"
)
message("Health damage report data created.")
```
### Results: Health
To easily visualize the comparative historical impacts of weather events, we use tree maps. Below are tree maps showing the fatalities and injuries from events.
As described earlier, the "Other" category is the catch-all for all the events remaining after those responsible for 85% of the fatalities/injuries we picked out.
```{r health_damage, fig.width=9, fig.height=9}
# make health impact visualizations
par(oma=c(1, 1, 1, 1))
grid.newpage()
grid.rect()
pushViewport(viewport(layout=grid.layout(2, 1)))
vp1 <- viewport(layout.pos.col=1, layout.pos.row=1)
pushViewport(vp1)
treemap(fatReportData
, index="event_pct"
, vSize="fat"
, type="index"
, fontsize.title=14
, title= paste("Fatalities By Event, 1950-2011 - Total: ", formatC(sum(fatSorted$fat), format="d", big.mark=","), sep='')
, vp=vp1
)
popViewport()
vp2 <- viewport(layout.pos.col=1, layout.pos.row=2)
pushViewport(vp2)
treemap(injReportData
, index="event_pct"
, vSize="inj"
, type="index"
, fontsize.title=14
, title= paste("Injuries By Event, 1950-2011 - Total: ", formatC(sum(injSorted$inj), format="d", big.mark=","), sep='')
, vp=vp2
)
popViewport()
```
### Results: Economic
To easily visualize the comparative historical impacts of weather events, we again use tree maps. Below are tree maps showing the (unadjusted) USD value of crop and property damage from weather events.
As described in the Processing sub-section, the "Other" category is the catch-all for all the events remaining after those responsible for 85% of the fatalities/injuries we picked out.
```{r economic_damage, fig.width=9, fig.height=9}
# make economic damage visualization
par(oma=c(1, 1, 1, 1))
grid.newpage()
grid.rect()
pushViewport(viewport(layout=grid.layout(2, 1)))
vp1 <- viewport(layout.pos.col=1, layout.pos.row=1)
pushViewport(vp1)
treemap(cropDmgReportData
, index='event_pct'
, vSize='crop_dmg'
, type='index'
, fontsize.title=14
, title= paste('Crop Damage By Event, 1950-2011 - Total: ', currency(sum(cropDmgSorted$crop_dmg), digits=0L), sep='')
, vp=vp1
)
popViewport()
vp2 <- viewport(layout.pos.col=1, layout.pos.row=2)
pushViewport(vp2)
treemap(propDmgReportData
, index='event_pct'
, vSize='prop_dmg'
, type='index'
, fontsize.title=14
, title= paste('Property Damage By Event, 1950-2011 - Total: ', currency(sum(propDmgSorted$prop_dmg), digits=0L), sep='')
, vp=vp2
)
popViewport()
```
### References
Coursera page for this project: https://www.coursera.org/learn/reproducible-research/peer/OMZ37/course-project-2
My github repo for this project: https://github.com/seedjay1/RepData_PA2
Raw data used for the analysis: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
RPubs link for this report: http://rpubs.com/seedjay/sda