-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_Practical_Multiple_Linear_Regression.Rmd
executable file
·529 lines (330 loc) · 15 KB
/
03_Practical_Multiple_Linear_Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
---
title: "03_Practical; Multiple Linear Regression"
author: "Ryan Greenup 17805315"
date: "12 March 2019"
output:
html_document:
code_folding: hide
keep_md: yes
theme: flatly
toc: yes
toc_depth: 4
toc_float: yes
pdf_document:
toc: yes
always_allow_html: yes
##Shiny can be good but {.tabset} will be more compatible with PDF
##but you can submit HTML in turnitin so it doesn't really matter.
##If a floating toc is used in the document only use {.tabset} on more or less copy/pasted
#sections with different datasets
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
load.pac <- function() {
if(require('pacman')){
library('pacman')
}else{
install.packages('pacman')
library('pacman')
}
pacman::p_load(xts, sp, gstat, ggplot2, rmarkdown, reshape2, ggmap,
parallel, dplyr, plotly, tidyverse, mise, corrplot, scales)
}
if (!exists("execFlag")) {
load.pac()
execFlag <- FALSE
}
```
# Multiple Linear Regression
Material of Tue 19 March2019, week 3
## Questoin 01 - Multiple Linear Regression
### (a) Upload the data “Advertising.csv” and explore it.
First import the data and investigate it:
```{r}
adv <- read.csv(file = "Datasets/Advertising.csv", header = TRUE, sep = ",")
```
```{r}
head(adv)
writeLines("\n")
print("***Dimensions***")
writeLines("\n")
dim(adv)
writeLines("\n")
print("***Summary***")
writeLines("\n")
summary(adv)
writeLines("\n")
print("***Structure***")
writeLines("\n")
str(adv)
writeLines("\n")
```
Fom this we can tell that there is one output, with 3 input values and 200 Observations.
### (b) Find the Covariance and Correlation Matrix of Sales, TV, Radio and Newspaper. {.tabset}
#### Base Packages
```{r}
pairs(x = adv)
```
#### corrplot
In order to use `corrplot` first create a correlation matrix using `cor(adv)` then feed that matrix to `corrplot` with the command `corrplot(cor(adv))`
```{r}
#coriris <- cor(iris[,!(names(iris) == "Species")])
#corrplot(method = 'ellipse', type = 'lower', corr = coriris)
corMat <- cor(x = adv)
corrplot(corr = corMat, method = "ellipse", type = "lower")
```
<!-- regular html comment -->
<!-- use </div> to escape {.tabset} don't use `## `; it will fuck your TOC-->
</div>
From this we can see that there is a significant amount of correlation between Radio and newspaper (moreso even that newspaper and sales), we should consider this variable interaction when decide upon our model
### (c) Construct the multiple linear regression model and find the least square estimates of the model
A multiple linear regression would give the model:
```{r}
advModMult <- lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
advModMult
```
This gives that the appropriate linear model is:
$$
Y_{Sales} = 0.0458 \times \text{TV} + 0.19 \times \text{Radio} - 0.001 \times \text{Newspaper}
$$
the fact that Newspaper has a negative coefficient despite being positively correlated with sales is indicative of the weak effect newspaper advertising has on sales as well as the interaction between newspaper and sales.
from this it can be
### (d) Test the significance of the parameters and find the resulting model to model Sales in terms of advertising modes, TV, Radio and Newspaper.
First Summarise the Model:
```{r}
advModMult %>% summary()
```
A summary of the model provides that given the hypothesis test:
$$
H_0: \enspace \beta_i = 0\\
H_a: \enspace \beta_i \neq 0 \\
\qquad \qquad \qquad \qquad \forall i \in \mathbb{N}
$$
There would be an extremely low probability of incorrectly rejecting the null hypothesis that the given a linear model the coefficients would be zero, hence it is accepted that the coefficients are non-zero except for newspaper advertising.
There is a high probability of incorrectly rejecting the null-hypothesis and hence that should not be rejected and the nespaper advertising should not be seen as significant predictor for sales in this linear model.
It is hence approprate to remove, via backwards selection, Newspaper from the model, which gives:
```{r}
## Calculate Power??
# n <- length(adv)
# sigma <- sd(adv$Newspaper)
# sem = sigma/sqrt(n)
# alpha <- 0.05
# mu0 <- 0
# q <- qnorm(p = 0.005, mean = mu0, sd = sem);q
# mu <- q #assumed actual mean value
# pnorm(q, mean = mu, sd = sem, lower.tail = FALSE)
```
```{r}
advModMultb1 <- lm(formula = Sales ~ TV + Radio, data = adv)
advModMultb1
advModMultb1.sum <- summary(advModMultb1)
```
$$
\text{Sales} = 0.045 \times \text{TV} + 0.19 \times \text{Radio} + 2.9211
$$
All parameters are highly significant and hence the model is deemed adequate, the model explains `r advModMultb1.sum$r.squared %>% round(,2) %>% percent()` of the variation, this can be found by appending `$r.squared` to the model object.
### (e) Assess the overall accuracy of the model.
In order to assess the model, consider the anova table and derive the RMSE and $R^2$ values:
```{r}
advModMultb1.anova <- anova(advModMultb1)
advModMultb1.anova
```
From this we can see that that the $F$ statistic is associated with a very low p-value, these are the exact same p-values from the `summary` call.
#### RMSE
The RMSE value is the Root mean square error, it is the standard error of the residuals (recall that the standard error is the standard deviation of a model parameter), so the RMSE is basically the standard deviation of the residuals of the model (as measured along the $y$-axis).
$$
\text{RMSE} = \sqrt{\frac{\sum{\varepsilon^2}}{n}}
$$
```{r}
rmse <- function(model){
# (sum(model$residuals**2)/length(model$residuals))**0.5
sd(advModMultb1$residuals)
}
rse <- function(model){
var(model$residuals)/var(model$residuals - model$fitted.values)
}
data.frame("RMSE" = rmse(advModMultb1) , "RSE" = rse(advModMultb1) )
```
So the expected error of the model is $\pm$ `r rmse(advModMultb1) %>% signif(2)` units of sale, we could use this to create a confidence interval by multiplying by the corresponding *Student's t-statistic* and saying that we would expect an observed value to lie within $1.96 * \text{S.E}$ of the model 95% of the time (is this correct or is it the expected mean or something because of the Central Limit Theorem?).
#### Coefficient of Determination
The coefficient of determination is `r advModMultb1.sum$r.squared %>% round(4) %>% percent()`, this can be returned by extracting the value from the object by apending `$r.squared` to the model summary (i.e. use `summary(lm( Y ~ X1 + X2 )))$r.squared`).
The $R^2$ value will be very nearly the same between the initial model and the backwards selected model, however given that the initial model had non-significant predictors it could be considered as over-parameterized, i.e. it violates [Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor).
The coefficient of determination is the ratio of the total variance of the data that is explained by the model, in this case it could be determined by:
$$
R^2 = \frac{TSS-SSE}{TSS} = \frac{3314.6+1546+0.1}{3315+1545+0.1+556.8} = 89\%
\ \\
$$
#### Coefficient of Determination from ANOVA
The $R^2$ value is derived from the ANOVA table thusly:
```{r}
advModMultb1.anova <- anova(advModMultb1)
TSS_Multb1 <- advModMultb1.anova$`Sum Sq` %>% sum()
RSS_Multb1 <- advModMultb1$residuals^2 %>% sum()
((TSS_Multb1- RSS_Multb1)/(TSS_Multb1)) %>% signif(3) %>% percent() #Requires `scales` package
```
![](./Images/AnovaCalc.jpg)
#### Residual analysis
When determining the accuracy or performance of amodel, always turn your mind to residual analysis (i.e. how normally are they distributed), this will be performed below.
### (f) Calculate the predicted values and residuals
The residuals and fitted values may be returned by extracting them from the model (you could always derive them from first principles but you should use the tools at your disposal, it is quicker, less error prone and makes more readable code)
```{r}
ResDF <- data.frame("Input" = advModMultb1$fitted.values - advModMultb1$residuals, "Output" = advModMultb1$fitted.values, "Error" = advModMultb1$residuals)
ResDF %>% head()
```
`Predict` will also return fitted values if no `newdata` is specified.
### (g) Plot the residuals against the predicted values {.tabset}
#### ggplot
```{r}
ggplot(data = ResDF, aes(x = Output, y = Error, col = Error )) +
geom_point() +
theme_bw() +
stat_smooth(col = "grey")+
stat_smooth(method = "lm", se = 0, ) +
scale_color_gradient(low = "indianred", high = "royalblue") +
labs(y = "Residuals", x = "Predicted Value", title = "Residuals Plotted against Output")
```
#### base
```{r}
plot(Error ~ Output, data = ResDF, pch = 19, col = "IndianRed", main = "Residuals against Predictions")
smooth <- loess(formula = Error ~ Output, data = ResDF, span = 0.8)
predict(smooth) %>% lines(col = "Blue")
abline(lm(Error ~ Output, data = ResDF), col = "Purple")
```
</div>
From this it can be seen that the residuals are centred around zero, but they are not normally distributed, the model performance appears to fail normality assumptions for values $\in \left( 10, 20 \right)$
### (h) Plot the histogram of the residuals {.tabset}
#### ggplot
##### Absolute Count
```{r}
ggplot(data = ResDF, aes(x = Error, col = Output)) +
geom_histogram() +
theme_classic() +
labs(y = "Absolute Count", x = "Residual Value", title = "Residual Distribution")
```
##### Density
```{r}
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(ResDF, aes(x=Error)) +
geom_histogram(aes(y = stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'IndianRed'
) +
theme_classic() +
labs(title = "Histogram of Residuals", y = "Density of Observations", x= "Residuals")
```
##### White Noise
We can visualise the residuals as white noise:
###### Our Residuals
```{r}
# Put some White Noise on the ResDF
ResDF <- cbind(ResDF, "Wnoise" = rnorm(nrow(ResDF), mean = 0, sd = sd(ResDF$Error)))
## Our Residuals
ggplot(ResDF, aes(x = Input, y = Error, col = Output)) +
geom_line() +
geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
theme(legend.position = "none") +
labs(title = "Distribution of Residuals")
## White Noise
ggplot(ResDF, aes(x = Input, y = Wnoise, col = Output)) +
geom_line() +
geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
theme(legend.position = "none") +
labs(title = "Distribution of Residuals")
```
This clearly shows that the Residuals are non-normal.
#### Base
```{r}
hist(ResDF$Error
, breaks = 30,
prob=TRUE,
lwd=2,
main = "Residual Distribution",
xlab = "Distance from the Model", border = "purple"
)
#Overlay the Normal Dist Curve
x <- 1:100 #Stupid base package wants some f(x), this is hacky but easier than stuffing around with lines or defining a function
curve(dnorm(x, mean(ResDF$Error), sd(ResDF$Error)), add=TRUE, col="purple", lwd=2) #Draws the actual density function
#lines(density(possibleyvals_conf), col='purple', lwd=2) #Draws the observed density function
lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))
abline(v=upr_conf, col='pink', lwd=3)
abline(v=lwr_conf, col='pink', lwd=3)
abline(v=mean(ResDF$Error), lwd=2, lty='dotted')
```
```{r}
lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))
```
</div>
It hence appears that the Residuals skewed left, which suggests an upper bound on the residual value, the histogram is sufficiently non-normal to reject the assumption that the residuals are normally distributed
### (i) Comment on the residual plots
The Residuals can be analised by plotting the model:
```{r}
plot(advModMultb1)
```
This is an inappropriate model because the residuals are non-normally distributed.
### (j) Use the multivariate model for predictions
```{r}
newdata = data.frame("TV" = c(3, 1, 2), "Radio" = c(4, 5, 6))
Mod_Adv_Mult_b1 <- predict(object = advModMultb1, newdata)
mypreds <- data.frame("Input" = newdata, "Output" = Mod_Adv_Mult_b1)
mypreds # Careful with the names, they workout as Input.name in this case.
```
## Question 02: Non Linear Models: Use Advertising data set
### (a) Add the Interaction Term TV*Radio and test the significance of the interaction term {.tabset}
reconsider the correlationplots:
#### Base
```{r}
pairs(x = adv)
```
### corrplot
```{r}
adv[, names(adv)!="Newspaper"] %>% cor() %>% corrplot(method = "ellipse", type = "lower")
```
</div>
From this we can determine that there is no interaction between TV and radio (however previously there was interaction between newspaper and Radio, we should consider that next)
#### Create the MLReg
```{r}
int_mod_adv <- lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
int_mod_adv
summary(int_mod_adv)
```
### (b) Give the resulting model after considering this interaction term.
The model will be of the form:
$$
\text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086
$$
Note that all the terms of the model are significant, hence we deem the model as adequate.
### (c) Construct the Polynomial Regression Model of order 3 and test the model significance
When creating polynomial models there are raw and orthogonal polynomials,
* A raw polynomial will be the the standard System of equations that you get by minimising the RSS
- The problem with this is that the values will be correlated with each other
* An orthogonal polynomial is a transformed but equivalent polynomial
- The advantage to this is that the p-value's will be moremeaningful because the coefficients aren't correlated with each other
- The disadvantage is that the values are not directly related to our data and are not hence meaningful.
```{r}
# Orthogonal (fixes the correlation between the coefficients))
polymodel_Orth <- lm(Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
summary(polymodel_Orth)
# Pure/Raw (has the advantage that you can interpret the coefficients, but the coefficients will depend on each other and be hence correlated.)
polymodel <- lm(Sales ~ I(TV*TV*TV) + I(TV*TV) + (TV), data = adv)
summary(polymodel) #I is used to inhibit the interpretation of * as relating to the model,
# Instead of representing interaction it represents TV^3
polymodel_Raw<- lm(Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
summary(polymodel_Raw)
```
The 3rd degree coefficient is not significant, hence we consider the 2nd degree:
```{r}
quadmod <- lm(formula = Sales ~ I(TV*TV) + TV, data = adv)
summary(quadmod)
```
The quadratic term is not sufficiently significant so the model is rejected
### (d) Give the resulting selected model
The model selected is the multiple linear regression with the interaction from before:
$$
\text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086
$$