forked from SloughJE/Cousera-Regression-Models
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprojectRegression.Rmd
192 lines (137 loc) · 10.5 KB
/
projectRegression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
**The Relationship Between Miles per Gallon and Transmission Type**
---
John Slough II
13 Jan 2015
===
**Executive Summary**
From our analysis of the mtcars dataset, we have determined that in general manual transmissions are better in terms of miles per gallon than automatic transmissions. In a linear regression model with only transmission type as an explanatory variable, a change from automatic to manual transmission increased the mpg by 7.245 however, transmission type only explained 36% of the variation in mpg. A linear regression model of all significant variables (determined by ANOVA), explained 84% of the variation in mpg. It included only the variables weight and number of cylinders. Transmission type was determined to be an insignificant contributory variable to the model. Furthermore, when transmission type was included in the model, a Bootstrap Measures of Relative Importance showed that it only contributed only about 14% to the r^2^ of 87%. It is recommended that the editors of *Motor Trend* consider the varibles weight, number of cylinders, and possibly horsepower as the most significant explanatory variables of miles per gallon.
**Report**
**The Data**
We are to investigate the relationship between miles per gallon (numerical class variable, mpg) and a set of explanatory variables. The explanatory variables and their classes are:
1. cyl: number of cylinders (factor, 4,6,8)
2. disp: displacement (cu.in.) (numerical)
3. hp: gross horsepower (numerical)
4. drat: rear axle ratio (numerical)
5. wt: weight (1000 pounds) (numerical)
6. qsec: 1/4 mile time (numerical)
7. vs: V/S, V-engine or Straight engine (factor, V,S)
8. am: transmission type (factor, automatic, manual)
9. gear: number of forwards gears (factor, 3,4,5)
10. carb: number of carburetors (factor, 1,2,3,4,5,6,7,8)
**Data Processing**
We will remove the variable qsec, 1/4 mile time, as this is not a reasonable explanatory variable for mpg and is better seen as another outcome variable. In addition, it is necessary to code the variables with their proper class (factor, numerical.)
```{r,echo=FALSE}
mtcars$cyl=factor(mtcars$cyl)
mtcars$vs=factor(mtcars$vs)
mtcars$gear=factor(mtcars$gear)
mtcars$carb=factor(mtcars$carb)
mtcars$am=factor(mtcars$am,labels=c('Automatic','Manual'))
auto=subset(mtcars, am=="Automatic")
man=subset(mtcars, am=="Manual")
```
**Exploratory Data Analysis**
First we must determine whether or not there is actually a difference between automatic and manual transmissions in terms of mpg. The following boxplot appears to show that there is a difference.
```{r,echo=FALSE,fig.height=9,fig.width=13}
color1="#F8766D75"
color2="#00BFC475"
col1legend="#F8766D"
col2legend="#00BFC4"
boxplot(auto$mpg,man$mpg,col=c(color1,color2),varwidth=TRUE,xlab="transmission type",ylab="mpg",main="Comparison of MPG of Automatic vs. Manual Transmission",names=c("automatic","manual"))
stripchart(mpg ~ am, data = mtcars, vertical = TRUE,method = "jitter",
jitter = 0.15, pch = 16, col = c("salmon","darkblue"),
bg = "green", add = TRUE)
```
Another way to view the difference between automatic and muanual transmissions' mpg is with a density plot. The plot below shows the two densities of automatic and manual transmissions. Again, it appears that the manual cars tend to have a higher mpg, but with more variation.
```{r,echo=FALSE,fig.height=9,fig.width=13}
a=density(auto$mpg)
m=density(man$mpg)
plot(m,ylim=c(0,.1),col="darkblue",lwd=1,xlab="mpg",main="Density of MPG by Transmission Type for MtCars")
lines(a,col="red",lwd=1)
polygon(a, col=color1, border="red")
polygon(m,col=color2,border="darkblue")
legend("topright",c("automatic","manual"),lty=c(1,1),lwd=c(2,2),col=c("salmon","darkblue"),bty="")
```
```{r,echo=FALSE}
library(plyr)
### mean of transmission type
Transmission=ddply(mtcars,"am",summarize, avg=mean(mpg))
test=t.test(mpg ~am, data = mtcars)
t=t.test(mpg ~am, data = mtcars)$p.value
```
To be sure that the means are different, we can perform a t-test. The t-test results in a p-value of `r t`. This means that we reject the null hypothesis that the means are similar.
Pairs Plot
The following is a chart plotting the more significant variables against each other (as determined by the models below). We can see that there are correlations between many variables therefore multicollinearity may be an issue. It is color-coded by transmission type.
```{r,echo=FALSE,message=FALSE,,fig.height=10,fig.width=12,cache=TRUE,warning=FALSE}
library(ggplot2)
library(GGally)
pairsGG=mtcars[,c(1,2,3,4,6,9)]
ggpairs(pairsGG,lower = list(continuous="smooth",theme_set(theme_bw())),title="mtcars",colour = "am")
```
**Building the Model**
We can now turn towards building the regression model for this dataset. We will first use multiple linear regression and the R "step" function, which chooses the optimal model by AIC (Aikake Information Criterion). With mpg as the outcome variable and all other variables, except 1/4 mile, as the explanatory variables we arrive at the model:
```{r,echo=FALSE,eval=FALSE}
library(MASS)
library(xtable)
fit= lm(mpg ~ cyl + disp + hp+drat+wt+vs+am+gear+carb, data=mtcars)
step = stepAIC(fit, direction="both")
```
```{r,echo=FALSE}
fitAIC= lm(mpg ~ cyl + hp+wt+am, data=mtcars)
sumAIC=summary(fitAIC)
round(fitAIC$coefficients,3)
```
and the coefficients' corresponding p-values:
```{r,echo=FALSE}
p.values=round(summary(fitAIC)$coefficients[,4],5)
sumAIC=summary(fitAIC)
p.values
```
The model here is: mpg ~ cyl + hp + wt + am.
The r^2^ for this model is `r round(sumAIC$r.squared,3)` which means that this model explains `r round(sumAIC$r.squared,3)*100`% of the variation in mpg. The model, arrived at by AIC, includes the variable am, or transmission type. The coefficient for am is `r round(fitAIC$coefficients[6],3)`, which we can interepret as, when other variables are held constant, a change from automatic to manual transmission will increase the mpg by `r round(fitAIC$coefficients[6],3)`. However, we must note the p-value associated with the transmission type variable. It is `r round(sumAIC$coefficients[24],3)` which is well above our usual 0.05 significant level.
If we remove the transmission type from this model, and refit it we obtain a model mpg ~ cyl + hp + wt. The coefficients and corresponding p-values are:
```{r echo=FALSE}
fitNOam=lm(mpg~ cyl+hp+wt,data=mtcars)
sumNOam=summary(fitNOam)
round(fitNOam$coefficients,3)
p.valuesNOam=round(sumNOam$coefficients[,4],5)
p.valuesNOam
pANOVA1=round(anova(fitAIC,fitNOam)$"Pr(>F)"[2],3)
fitNOhp=lm(mpg~cyl+wt,data=mtcars)
sumNOhp=summary(fitNOhp)
pANOVA2=round(anova(fitNOam,fitNOhp)$"Pr(>F)"[2],3)
pANOVA3=round(anova(fitAIC,fitNOhp)$"Pr(>F)"[2],3)
```
An ANOVA test between this model and this previous results in a p-value of `r pANOVA1` which means that we should choose the simpler model. However, we can further simplify the model by removing the hp or horsepower variable since the p-value for its coefficient is above 0.05. This results in the model mpg ~ cyl + wt. An ANOVA test between this model and previous two models resulted in p-values of `r pANOVA2` and `r pANOVA3`, respectively. Both are above 0.05 so we can choose the simpler model.
The final model is:
```{r,echo=FALSE}
summary(fitNOhp)
```
Here, all of the coefficients are highly significant. This model has an r^2^ of `r round(sumNOhp$r.squared,3)` which means that it explains `r round(sumNOhp$r.squared,3)*100`% of the variation in mpg. This is very close to our original model with 2 more variables. Note that transmission type is not included in the final model.
The plots below are the diagnostic plots for the model. There do not appear to be any problems with these plots; the residuals appear randomly, the standardized residuals appear normally distributed, and there are not any highly influential outliers.
```{r,echo=FALSE,,fig.height=9,fig.width=13}
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fitNOhp,pch=16)
```
**Model with only Transmission Type**
```{r echo=FALSE}
fitT=lm(mpg~am,data=mtcars) # this R^2 vs. fithp R^2
sumfitT=summary(fitT)
pANOVA4=round(anova(fitT,fitNOhp)$"Pr(>F)"[2],10)
```
The project asks us about two specific points:
1. Is an automatic or manual transmission better for MPG?
2. Quantify the MPG difference between automatic and manual transmissions
We have seen in the explaoratory data analysis section that manual transmissions are generally better for mpg. To quantify the difference, a linear model was fit using only mpg as an explanatory variable. This model produced an `r round(sumfitT$r.squared,3)` which means that it explains `r round(sumfitT$r.squared,3)*100`% of the variation in mpg, much less than the final model arrived at above. The coefficient is significant, and at `r round(fitT$coefficients[2],3)`, which we can interepret as, automatic to manual transmission will increase the mpg by `r round(fitT$coefficients[2],3)`. Sounds like transmission type does have an impact on mpg, however when the previous models are considered, other variables are much more impartant than transmission type in explaining the variation in mpg. An ANOVA between this model and the final model above results in a p-value of less than 0.0000001 which means that we reject the null hypothesis that the models are similar. We must use the more complex model, as it explains a significantly higher proportion of variation in mpg than the simpler model.
**Relative Importance of Variables**
For our last analysis we will look at Bootstrap Measures of Relative Importance of explanatory variables. This is a nice way to interpret the importance of variables with regard to their contribution to the r^2^ of a model. We used the model fit by AIC which includes the variables cylinder, horsepower, weight, and transmission type. The results are shown in the plot below. We can clearly see that am, transmission type, is the least important of these variables contributing only about 14% to the r^2^.
```{r,echo=FALSE,cache=TRUE,warning=FALSE,cache=TRUE,,fig.height=9,fig.width=13}
library(relaimpo)
boot <- boot.relimp(fitAIC, b = 500, type = c("lmg"), rank = TRUE,diff = TRUE, rela = TRUE)
boot2=booteval.relimp(boot) # print result
plot(booteval.relimp(boot,sort=TRUE)) # plot result
```
**Conclusion**
We have determined that there is a difference in mpg in relation to transmission type and have quantified that difference. However, transmission type does not appear to be a very good explanatory variable for mpg; weight, horsepower, and number of cylinders are all more significant variables.
**R-code**
Available at: https://github.com/SloughJE/Cousera-Regression-Models