-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPeelCollisons.Rmd
233 lines (158 loc) · 8.87 KB
/
PeelCollisons.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
title: "Peel Accident Analysis"
author: "Mausam Duggal, WSP|PB, Systems Analysis"
date: "June 25, 2016"
output: html_document
---
```{r init, echo=FALSE, message=FALSE}
library(dplyr); library(ggplot2); library(knitr); library(lubridate); library(foreign); library(gtools)
```
### Mining of Peel's Truck Collison Data
This analysis focuses on the truck collison data that the Region of Peel provided to the team as part of the study. The intention is to mine the data and spot any obvious trends that might be seen. Going in to the data analysis, I initially thought that we would see relatively higher accidents in the winter season, with fewer in the summer as daylight savings time and a longer day in general would improve visibility.
In the final analysis, it is our intention to geocode these locations on to a GIS network and tag the daily volumes to these locations. Once that is done, we would like to explore a mathematical relationship between volume, season, hour, and daily flows on the rate of accidents.
We have not as yet geocoded the data, but the current analysis is the first step in the process.
#### INPUT DATA
We start by setting the **input directory** and loading the datasets and mining it in the hope of seeing some interesting patterns.
```{r Set Working Directory}
opts_knit$set(root.dir = 'c:/personal/r')
```
```{r I am sick and tired of NAs so a function to set them for good}
f_rep <- function(df) {
# this function is used to set all NA values to zero in a dataframe
df[is.na(df)] <- 0
return(df)
}
```
```{r batching in the DAY and HOUR dataset}
#' master file
acc <- read.csv(file = "c:/personal/r/PeelAccidents.csv", stringsAsFactors = FALSE)
#' now extract the month and time and create two new columns for the same
acc$month <- month(as.POSIXlt(acc$Accident.Date, format="%m/%d/%Y"))
acc$hour <- sapply(strsplit(acc$Accident.Time, ":"), "[", 1)
#' convert hour column to numeric for sorting
acc$hour <- as.numeric(acc$hour)
acc <- subset(acc, hour!= "Unknown") %>% .[order(.$hour, .$month),]
# create queries for populating the season field
mut1 <- acc$month >= 3 & acc$month <= 5
mut2 <- acc$month >= 6 & acc$month <= 8
mut3 <- acc$month >= 9 & acc$month <= 11
mut4 <- acc$month >= 1 & acc$month <= 2 | acc$month == 12
# populate the season field
acc[mut1, "season"] <- "Spring"
acc[mut2, "season"] <- "Summer"
acc[mut3, "season"] <- "Fall"
acc[mut4, "season"] <- "Winter"
```
#### GO ALL IN AND SEE WHAT THE DATA SAYS
Now lets plot the variables and see what the data shows. The first attempt will just **plot the accident counts by season, year, and hour.**
```{r Now plot the data}
#' Other than 2015, the highest count of accidents across any given hour takes place in either Fall, Spring, or SUmmer
#' WInter by far seems to have the lowest truck accidents, although 2015 stands out marginally.
#' Interestingly, 2015 Summer and Spring are the lowest accident seasons from 2011 onwards.
ggplot(acc, aes(hour, fill = season)) +
geom_histogram(binwidth = 1) + facet_grid(Accident.Year~season)
```
#### FOCUS ON WINTER AND SPRING
**Spring and Winter** are interesting. One would expect that these two seasons to have more unique patterns given the uncertain weather.
```{r Winter/Spring patterns}
#' only keep spring and winter records
acc.filter <- subset(acc, season == "Winter" | season == "Spring")
#' plot to see if there is a pattern of accidents during the peak periods.
#' Represent the middle of the peak period by black lines.
#' There does not seem to be a significant spike in accidents during the peak periods and this could be attributed to the fact that truck traffic is generally
#' lower at that time. It is between those two lines that the bulk of the accidents seem to be takin place.
ggplot(acc.filter, aes(hour, fill = season)) +
geom_histogram(binwidth = 1) + facet_grid(season~Accident.Year) +
geom_vline(xintercept = 8, color = "black") +
geom_vline(xintercept = 17, color = "black")
```
```{r fall patterns}
acc.filter1 <- subset(acc, season == "Fall")
#' plot to see if there is a pattern of accidents during the peak periods.
#' Represent the middle of the peak period by black lines.
#' There does not seem to be a significant spike in accidents during the peak periods and this could be attributed to the fact that truck traffic is generally
#' lower at that time. It is between those two lines that the bulk of the accidents seem to be takin place.
ggplot(acc.filter1, aes(hour, fill = season)) +
geom_histogram(binwidth = 1) + facet_grid(.~Accident.Year) +
geom_vline(xintercept = 8, color = "black") +
geom_vline(xintercept = 17, color = "black")
```
#### What do we know so far?
I did not know what to expect when I started this analysis. But, the results so far seem to suggest that the bulk of the accidents take place during the off-peak, suggesting that the accidents are more a function of the increase in truck traffic on the road during the off-peaks then it is due to an increase of regular commuter flows as one would get during the morning and evening peak periods.
#### Geographical Constraints
Now let's batch in the geocoded locations that Kitty has put together that are also spatially matched the GGHV4 network.
```{r}
peel <- read.dbf("c:/personal/r/AllPeelCollisions_spatialjoin1.dbf") %>% subset(., Classifica != "04 - Non-reportable")
links <- read.dbf("c:/personal/r/Peel_links.dbf")
```
Plot the results by link speed, capacity, and lanes
```{r}
ggplot(peel, aes(DATA2, fill = factor(DATA3))) + geom_histogram(binwidth = 5) + facet_grid(. ~ LANES) + xlab("Speed (km/hr)") + ggtitle("Accident distribution by link speed, capacity, and lanes") + labs(fill = "Capacity")
```
### Logistic Regression
Now let's look into developing a logistic regression model with the primary aim being to assign a probability to each link of whether it would witness an accident. If the probability is over 0.5, then that link has an increasing chance of being a ** collison hotspot.**
```{r}
# develop the dependent variable, where all accident records are coded as 1 and 0 for the rest. Also create some other variables
peel = transform(peel, Dep = 1) %>% transform(., Spd = ifelse(LENGTH/(TIMAU/60) > 59, 1, 0)) %>% transform(., vc = ifelse(VOLAU/(DATA3*LANES) < 1.0,0,1)) %>% transform(., lsq = LANES*Spd)
```
```{r This code block creates all the necessary variables in the spatially matched data as well as links in Peel}
#' firs for the spatial data
peel$Season <- acc$season[match(peel$Accident_N, acc$Accident.No.)]
peel$PP <- acc$hour[match(peel$Accident_N, acc$Accident.No.)]
peel <- transform(peel, weather = ifelse(Season == "Winter", 1, 0))
peel <- transform(peel, ff = (LENGTH/DATA2)*60)
peel <- transform(peel, int = VOLAU*TIMAU)
peel <- transform(peel, rat = TIMAU/ff)
peel <- transform(peel, timau = TIMAU)
peel <- transform(peel, hr = ifelse(PP>9 & PP<16, 1, 0))
peel <- transform(peel, vkt = VOLAU*LENGTH)
peel <- transform(peel, vol = VOLAU)
# clean the data
peel1 <- peel[, c(31,37,40,41,80:93)]
peel2 <- na.omit(peel1)
#' next for all the links in Peel region
links$Season <- "none"
links$PP <- 100
links <- transform(links, weather = ifelse(Season == "Winter", 1, 0))
links <- transform(links, ff = (LENGTH/DATA2)*60)
links <- transform(links, int = VOLAU*TIMAU)
links <- transform(links, rat = TIMAU/ff)
links <- transform(links, timau = TIMAU)
links <- transform(links, hr = 0)
links <- transform(links, vkt = VOLAU*LENGTH)
links <- transform(links, vol = VOLAU)
links = transform(links, Dep = 0) %>% transform(., Spd = ifelse(LENGTH/(TIMAU/60) > 59, 1, 0)) %>%
transform(., vc = ifelse(VOLAU/(DATA3*LANES) < 1.0,0,1)) %>% transform(., lsq = LANES*Spd)
#' only keep the non-common links
links1 <- anti_join(links, peel2, by="ID")
#' Now create the estimation data
peel3 <- smartbind(peel2,links1) %>% .[, 1:18]
peel4 <- f_rep(peel3)
# some additional variables
peel4$ccap <- peel4$DATA3*peel4$LANES
peel4$lsq <- peel4$LANES * peel4$DATA2
peel4 <- transform(peel4, art = ifelse(DATA3 >799 & DATA3<1200, 1, 0))
```
```{r Test the logistic model}
# the variables are as follows:
# rat - ratio of Congested time over Free Flow time
# art - dummy for identifying arterial classification
# ccap - corridor capacity i.e. lanes * lane capacity
# vol - loaded auto volumes
model <- glm(Dep ~ rat + art + ccap + vol, family = binomial(link = 'logit'), data = peel4)
summary(model)
```
```{r Look at the coefficients table}
#' produce summary results
kable(summary(model)$coef, digits=6)
```
```{r Check model application}
# get model prediction
p <- as.data.frame(predict(model, peel4, type="response"))
text <- "Prob"
colnames(p) <- text
# bind the predicted probabilities
peel4 <- cbind(peel4, p)
# only get those records that had a DEP == 1, which means that an accident was recorded at that link. This is just to check how well the predictions have actually performed.
check <- subset(peel4, Dep == 1)
```