SUBAM KATHET
+This course was recommended by a friend who is in the masters program +in data science at the University of Helsinki. Data management, mining, +data analytic and machine learning among many other within the same +sphere are the next generation skill set everyone is recommended to +acquire and here I am. I am a bit nervous, very exited and mostly +curious to take a deep dive into the world of data science.
+Here is the link to my github webpage.
+https://iamsubam.github.io/IODS-project/
+And here is the link to my course diary.
+ +I have only had some time to browse through the R for Health Data +Science. Coming from a background of experimental epidemiology, I was +drawn immediately by linear and logistic regression because this is +something I often rely on in my work. I looked into survival analysis +briefly because of my interest and found it quite interesting. Although +I need to practice a lot before I can get my hands around the analysis. +I think the book gives a great over view of essential statistical +analysis required on a fundamental level. Some knowledge of statistics +can be of great advantage as R platform is already designed with a steep +learning curve.
+# This is a so-called "R chunk" where you can write R code.
+
+date()
+## [1] "Mon Nov 14 15:29:30 2022"
+# Trying to check if the chunk works or not. It is usually a struggle especially when R version is outdated and the packages doesn't work. Then re installing and updating the packages and coming back to work it out is a big struggle.
+This set consists of a few numbered exercises. Go to each exercise in +turn and do as follows:
+One or more extra packages (in addition to tidyverse
)
+will be needed below.
# Select (with mouse or arrow keys) the install.packages("...") and
+# run it (by Ctrl+Enter / Cmd+Enter):
+
+library(dplyr)
+##
+## Attaching package: 'dplyr'
+## The following objects are masked from 'package:stats':
+##
+## filter, lag
+## The following objects are masked from 'package:base':
+##
+## intersect, setdiff, setequal, union
+library(tidyverse)
+## ── Attaching packages
+## ───────────────────────────────────────
+## tidyverse 1.3.2 ──
+## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
+## ✔ tibble 3.1.8 ✔ stringr 1.4.1
+## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
+## ✔ readr 2.1.2
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag() masks stats::lag()
+library(GGally)
+## Registered S3 method overwritten by 'GGally':
+## method from
+## +.gg ggplot2
+library(ggplot2)
+library(gapminder)
+library(finalfit)
+library(broom)
+The first step of data analysis with R is reading data into R. This +is done with a function. Which function and function arguments to use to +do this, depends on the original format of the data.
+Conveniently in R, the same functions for reading data can usually be +used whether the data is saved locally on your computer or somewhere +else behind a web URL.
+After the correct function has been identified and data read into R,
+the data will usually be in R data.frame
format. Te
+dimensions of a data frame are (\(n\),\(d\)), where \(n\) is the number of rows (the
+observations) and \(d\) the number of
+columns (the variables).
The purpose of this course is to expose you to some basic and +more advanced tasks of programming and data analysis with +R.
+lrn14
data frame to memory with
+read.table()
. There is information related to the data heredim()
on the data frame to look at the dimensions
+of the data. How many rows and colums does the data have?str()
.Hint: - For both functions you can pass a data frame as the first +(unnamed) argument.
+# This is a code chunk in RStudio editor.
+# Work with the exercise in this chunk, step-by-step. Fix the R code!
+
+# read the data into memory
+lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt", sep="\t", header=TRUE)
+## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
+## EOF within quoted string
+# Look at the dimensions of the data
+
+# Look at the structure of the data
+#use .txt file to import data set for better description.
+# Preliminary results available at http://www.slideshare.net/kimmovehkalahti/the-relationship-between-learning-approaches-and-students-achievements-in-an-introductory-statistics-course-in-finland
+#Total respondents n=183, total question n=60, so 184 rows including heading and 60 columns
+#The code as respective column heading represents a question related to the survey and number. Each SN is a respondents and the answers to each question are given in a Lickert scale (0-5).
+
+dim(lrn14)
+## [1] 28 1
+str(lrn14)
+## 'data.frame': 28 obs. of 1 variable:
+## $ Variable.descriptions.and.other.meta.data.of.JYTOPKYS3..syksy.2016.: chr "Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183" "Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014" "(Introduction to Social Statistics, fall 2014 - in Finnish)," "international survey of Approaches to Learning, made possible" ...
+The next step is wrangling the +data into a format that is easy to analyze. We will wrangle our data +for the next few exercises.
+A neat thing about R is that may operations are vectorized. +It means that a single operation can affect all elements of a vector. +This is often convenient.
+The column Attitude
in lrn14
is a sum of 10
+questions related to students attitude towards statistics, each measured
+on the Likert
+scale (1-5). Here we’ll scale the combination variable back to the
+1-5 scale.
attitude
in
+the lrn14
data frame, where each observation of
+Attitude
is scaled back to the original scale of the
+questions, by dividing it with the number of questions.Hint: - Assign ‘Attitude divided by 10’ to the new column +’attitude.
+lrn14$attitude <- lrn14$Attitude / 10
+Our data includes many questions that can be thought to measure the +same dimension. You can read more about the data and the +variables here. +Here we’ll combine multiple questions into combination variables. Useful +functions for summation with data frames in R are
+function | +description | +
---|---|
colSums(df) |
+returns a sum of each column in df |
+
rowSums(df) |
+returns a sum of each row in df |
+
colMeans(df) |
+returns the mean of each column in df |
+
rowMeans(df) |
+return the mean of each row in df |
+
We’ll combine the use of rowMeans()
with the
+select()
function from the dplyr library
+to average the answers of selected questions. See how it is done from
+the example codes.
lrn14
lrn14
lrn14
Hints: - Columns related to strategic learning are in the object
+strategic_questions
. Use it for selecting the correct
+columns. - Use the function rowMeans()
identically to the
+examples
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# lrn14 is available
+
+# Access the dplyr library
+library(dplyr)
+
+# questions related to deep, surface and strategic learning
+deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
+surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
+strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
+
+
+# select the columns related to deep learning
+deep_columns <- select(lrn14, one_of(deep_questions))
+# and create column 'deep' by averaging
+lrn14$deep <- rowMeans(deep_columns)
+
+# select the columns related to surface learning
+surface_columns <- select(lrn14, one_of(surface_questions))
+# and create column 'surf' by averaging
+lrn14$surf <- rowMeans(surface_columns)
+
+# select the columns related to strategic learning
+strategic_columns <- select(lrn14, one_of(strategic_questions))
+# and create column 'stra' by averaging
+lrn14$stra <- rowMeans(strategic_columns)
+Often it is convenient to work with only a certain column or a subset
+of columns of a bigger data frame. There are many ways to select columns
+of data frame in R and you saw one of them in the previous exercise:
+select()
from dplyr*.
dplyr is a popular library for data +wrangling. There are also convenient data wrangling +cheatsheets by RStudio to help you get started (dplyr, tidyr +etc.)
+keep_columns
select()
(possibly together with
+one_of()
) to create a new data frame
+learning2014
with the columns named in
+keep_columns
.Hint: - See the previous exercise or the data wrangling cheatsheet +for help on how to select columns
+# Work with the exercise in this chunk, step-by-step. Fix the R code!
+
+# lrn14 is available
+
+# access the dplyr library
+library(dplyr)
+
+# choose a handful of columns to keep
+keep_columns <- c("gender","Age","attitude", "deep", "stra", "surf", "Points")
+
+# select the 'keep_columns' to create a new dataset
+learning2014 <- select(lrn14,all_of(keep_columns))
+
+# see the structure of the new dataset
+
+print(learning2014)
+## gender Age attitude deep stra surf Points
+## 1 F 53 3.7 3.583333 3.375 2.583333 25
+## 2 M 55 3.1 2.916667 2.750 3.166667 12
+## 3 F 49 2.5 3.500000 3.625 2.250000 24
+## 4 M 53 3.5 3.500000 3.125 2.250000 10
+## 5 M 49 3.7 3.666667 3.625 2.833333 22
+## 6 F 38 3.8 4.750000 3.625 2.416667 21
+## 7 M 50 3.5 3.833333 2.250 1.916667 21
+## 8 F 37 2.9 3.250000 4.000 2.833333 31
+## 9 M 37 3.8 4.333333 4.250 2.166667 24
+## 10 F 42 2.1 4.000000 3.500 3.000000 26
+## 11 M 37 3.9 3.583333 3.625 2.666667 31
+## 12 F 34 3.8 3.833333 4.750 2.416667 31
+## 13 F 34 2.4 4.250000 3.625 2.250000 23
+## 14 F 34 3.0 3.333333 3.500 2.750000 25
+## 15 M 35 2.6 4.166667 1.750 2.333333 21
+## 16 F 46 2.5 3.083333 3.125 2.666667 0
+## 17 F 33 4.1 3.666667 3.875 2.333333 31
+## 18 F 32 2.6 4.083333 1.375 2.916667 20
+## 19 F 44 2.6 3.500000 3.250 2.500000 22
+## 20 M 29 1.7 4.083333 3.000 3.750000 9
+## 21 F 30 2.7 4.000000 3.750 2.750000 24
+## 22 M 27 3.9 3.916667 2.625 2.333333 28
+## 23 M 29 3.4 4.000000 2.375 2.416667 30
+## 24 F 31 2.7 4.000000 3.625 3.000000 24
+## 25 F 37 2.3 3.666667 2.750 2.416667 9
+## 26 F 26 3.7 3.666667 1.750 2.833333 26
+## 27 F 26 4.4 4.416667 3.250 3.166667 32
+## 28 M 30 4.1 3.916667 4.000 3.000000 32
+## 29 F 37 2.4 3.833333 2.125 2.166667 0
+## 30 F 33 3.7 3.750000 3.625 2.000000 33
+## 31 F 33 2.5 3.250000 2.875 3.500000 29
+## 32 M 28 3.0 3.583333 3.000 3.750000 30
+## 33 M 26 3.4 4.916667 1.625 2.500000 19
+## 34 F 27 3.2 3.583333 3.250 2.083333 23
+## 35 F 25 2.0 2.916667 3.500 2.416667 19
+## 36 F 31 2.4 3.666667 3.000 2.583333 12
+## 37 M 20 4.2 4.500000 3.250 1.583333 10
+## 38 F 39 1.6 4.083333 1.875 2.833333 11
+## 39 M 38 3.1 3.833333 4.375 1.833333 20
+## 40 M 24 3.8 3.250000 3.625 2.416667 26
+## 41 M 26 3.8 2.333333 2.500 3.250000 31
+## 42 M 25 3.3 3.333333 1.250 3.416667 20
+## 43 F 30 1.7 4.083333 4.000 3.416667 23
+## 44 F 25 2.5 2.916667 3.000 3.166667 12
+## 45 M 30 3.2 3.333333 2.500 3.500000 24
+## 46 F 48 3.5 3.833333 4.875 2.666667 17
+## 47 F 24 3.2 3.666667 5.000 2.416667 29
+## 48 F 40 4.2 4.666667 4.375 3.583333 23
+## 49 M 25 3.1 3.750000 3.250 2.083333 28
+## 50 F 23 3.9 3.416667 4.000 3.750000 31
+## 51 F 25 1.9 4.166667 3.125 2.916667 23
+## 52 F 23 2.1 2.916667 2.500 2.916667 25
+## 53 M 27 2.5 4.166667 3.125 2.416667 18
+## 54 M 25 3.2 3.583333 3.250 3.000000 19
+## 55 M 23 3.2 2.833333 2.125 3.416667 22
+## 56 F 23 2.6 4.000000 2.750 2.916667 25
+## 57 F 23 2.3 2.916667 2.375 3.250000 21
+## 58 F 45 3.8 3.000000 3.125 3.250000 9
+## 59 F 22 2.8 4.083333 4.000 2.333333 28
+## 60 F 23 3.3 2.916667 4.000 3.250000 25
+## 61 M 21 4.8 3.500000 2.250 2.500000 29
+## 62 M 21 4.0 4.333333 3.250 1.750000 33
+## 63 F 21 4.0 4.250000 3.625 2.250000 33
+## 64 F 21 4.7 3.416667 3.625 2.083333 25
+## 65 F 26 2.3 3.083333 2.500 2.833333 18
+## 66 F 25 3.1 4.583333 1.875 2.833333 22
+## 67 F 26 2.7 3.416667 2.000 2.416667 17
+## 68 M 21 4.1 3.416667 1.875 2.250000 25
+## 69 F 22 3.4 3.500000 2.625 2.333333 0
+## 70 F 23 3.4 3.416667 4.000 2.833333 28
+## 71 F 22 2.5 3.583333 2.875 2.250000 22
+## 72 F 22 2.1 1.583333 3.875 1.833333 26
+## 73 F 22 1.4 3.333333 2.500 2.916667 11
+## 74 F 23 1.9 4.333333 2.750 2.916667 29
+## 75 M 22 3.7 4.416667 4.500 2.083333 22
+## 76 M 25 4.1 4.583333 3.500 2.666667 0
+## 77 M 23 3.2 4.833333 3.375 2.333333 21
+## 78 M 24 2.8 3.083333 2.625 2.416667 28
+## 79 F 22 4.1 3.000000 4.125 2.750000 33
+## 80 F 23 2.5 4.083333 2.625 3.250000 16
+## 81 M 22 2.8 4.083333 2.250 1.750000 31
+## 82 M 20 3.8 3.750000 2.750 2.583333 22
+## 83 M 22 3.1 3.083333 3.000 3.333333 31
+## 84 M 21 3.5 4.750000 1.625 2.833333 23
+## 85 F 22 3.6 4.250000 1.875 2.500000 26
+## 86 F 23 2.6 4.166667 3.375 2.416667 12
+## 87 M 21 4.4 4.416667 3.750 2.416667 26
+## 88 M 22 4.5 3.833333 2.125 2.583333 31
+## 89 M 29 3.2 3.333333 2.375 3.000000 19
+## 90 F 26 2.0 3.416667 1.750 2.333333 0
+## 91 F 29 3.9 3.166667 2.750 2.000000 30
+## 92 F 21 2.5 3.166667 3.125 3.416667 12
+## 93 M 28 3.3 3.833333 3.500 2.833333 17
+## 94 M 23 3.5 4.166667 1.625 3.416667 0
+## 95 F 21 3.3 4.250000 2.625 2.250000 18
+## 96 F 30 3.0 3.833333 3.375 2.750000 19
+## 97 F 21 2.9 3.666667 2.250 3.916667 21
+## 98 M 23 3.3 3.833333 3.000 2.333333 24
+## 99 F 21 3.3 3.833333 4.000 2.750000 28
+## 100 F 21 3.5 3.833333 3.500 2.750000 17
+## 101 F 20 3.6 3.666667 2.625 2.916667 18
+## 102 M 21 4.2 4.166667 2.875 2.666667 0
+## 103 M 22 3.7 4.333333 2.500 2.083333 17
+## 104 F 35 2.8 4.416667 3.250 3.583333 0
+## 105 M 21 4.2 3.750000 3.750 3.666667 23
+## 106 M 22 2.2 2.666667 2.000 3.416667 0
+## 107 M 21 3.2 4.166667 3.625 2.833333 26
+## 108 F 20 5.0 4.000000 4.125 3.416667 28
+## 109 M 22 4.7 4.000000 4.375 1.583333 31
+## 110 F 20 3.6 4.583333 2.625 2.916667 27
+## 111 F 20 3.6 3.666667 4.000 3.000000 25
+## 112 M 24 2.9 3.666667 2.750 2.916667 23
+## 113 F 20 3.5 3.833333 2.750 2.666667 21
+## 114 F 19 4.0 2.583333 1.375 3.000000 27
+## 115 F 21 3.5 3.500000 2.250 2.750000 28
+## 116 F 21 3.2 3.083333 3.625 3.083333 23
+## 117 F 22 2.6 4.250000 3.750 2.500000 21
+## 118 F 25 2.0 3.166667 4.000 2.333333 25
+## 119 F 21 2.7 3.083333 3.125 3.000000 11
+## 120 F 22 3.2 4.166667 3.250 3.000000 19
+## 121 F 25 3.3 2.250000 2.125 4.000000 24
+## 122 F 20 3.9 3.333333 2.875 3.250000 28
+## 123 M 24 3.3 3.083333 1.500 3.500000 21
+## 124 F 20 3.0 2.750000 2.500 3.500000 24
+## 125 M 21 3.7 3.250000 3.250 3.833333 24
+## 126 F 26 1.4 3.750000 3.250 3.083333 0
+## 127 M 28 3.0 4.750000 2.500 2.500000 0
+## 128 F 20 2.5 4.000000 3.625 2.916667 20
+## 129 F 20 2.9 3.583333 3.875 2.166667 19
+## 130 M 31 3.9 4.083333 3.875 1.666667 30
+## 131 F 20 3.6 4.250000 2.375 2.083333 22
+## 132 F 22 2.9 3.416667 3.000 2.833333 16
+## 133 F 22 2.1 3.083333 3.375 3.416667 16
+## 134 M 21 3.1 3.500000 2.750 3.333333 19
+## 135 F 20 2.4 3.916667 3.125 3.500000 0
+## 136 M 22 4.0 3.666667 4.500 2.583333 30
+## 137 F 21 3.1 4.250000 2.625 2.833333 23
+## 138 F 21 2.3 4.250000 2.750 3.333333 19
+## 139 F 21 2.8 3.833333 3.250 3.000000 18
+## 140 F 21 3.7 4.416667 4.125 2.583333 28
+## 141 F 20 2.6 3.500000 3.375 2.416667 21
+## 142 F 21 2.4 3.583333 2.750 3.583333 19
+## 143 F 25 3.0 3.666667 4.125 2.083333 27
+## 144 F 27 2.9 4.250000 3.000 2.750000 0
+## 145 F 20 3.2 3.750000 4.125 3.916667 0
+## 146 M 21 2.8 2.083333 3.250 4.333333 24
+## 147 F 24 2.9 4.250000 2.875 2.666667 21
+## 148 F 20 2.4 3.583333 2.875 3.000000 20
+## 149 M 21 3.1 4.000000 2.375 2.666667 28
+## 150 F 20 1.9 3.333333 3.875 2.166667 12
+## 151 F 20 2.0 3.500000 2.125 2.666667 21
+## 152 F 18 3.8 3.166667 4.000 2.250000 28
+## 153 F 21 3.4 3.583333 3.250 2.666667 31
+## 154 F 19 3.7 3.416667 2.625 3.333333 18
+## 155 F 21 2.9 4.250000 2.750 3.500000 25
+## 156 F 20 2.3 3.250000 4.000 2.750000 19
+## 157 M 21 4.1 4.416667 3.000 2.000000 21
+## 158 F 20 2.7 3.250000 3.375 2.833333 16
+## 159 F 21 3.5 3.916667 3.875 3.500000 7
+## 160 F 20 3.4 3.583333 3.250 2.500000 21
+## 161 F 18 3.2 4.500000 3.375 3.166667 17
+## 162 M 22 3.3 3.583333 4.125 3.083333 22
+## 163 F 22 3.3 3.666667 3.500 2.916667 18
+## 164 M 24 3.5 2.583333 2.000 3.166667 25
+## 165 F 19 3.2 4.166667 3.625 2.500000 24
+## 166 F 20 3.1 3.250000 3.375 3.833333 23
+## 167 F 21 2.4 3.583333 2.250 2.833333 0
+## 168 F 20 2.8 4.333333 2.125 2.250000 23
+## 169 F 17 1.7 3.916667 4.625 3.416667 26
+## 170 M 19 1.9 2.666667 2.500 3.750000 12
+## 171 F 23 3.2 3.583333 2.000 1.750000 0
+## 172 F 20 3.5 3.083333 2.875 3.000000 32
+## 173 F 20 2.4 3.750000 2.750 2.583333 22
+## 174 F 25 3.8 4.083333 3.375 2.750000 0
+## 175 F 20 2.1 4.166667 4.000 3.333333 20
+## 176 F 20 2.9 4.166667 2.375 2.833333 21
+## 177 F 19 1.9 3.250000 3.875 3.000000 23
+## 178 F 19 2.0 4.083333 3.375 2.833333 20
+## 179 F 22 4.2 2.916667 1.750 3.166667 28
+## 180 M 35 4.1 3.833333 3.000 2.750000 31
+## 181 F 18 3.7 3.166667 2.625 3.416667 18
+## 182 F 19 3.6 3.416667 2.625 3.000000 30
+## 183 M 21 1.8 4.083333 3.375 2.666667 19
+Sometimes you want to rename your column. You could do this by
+creating copies of the columns with new names, but you can also directly
+get and set the column names of a data frame, using the function
+colnames()
.
The dplyr library has a rename()
+function, which can also be used. Remember the
+cheatsheets.
learning2014
Hint: - You can use colnames()
similarily to the
+example. Which index matches the column ‘Points’?
print(names(learning2014))
+## [1] "gender" "Age" "attitude" "deep" "stra" "surf" "Points"
+colnames(learning2014)[2] <- "age"
+learning2014 <- rename(learning2014, points = Points)
+print(dim(learning2014)) #check the dimension now (must have 166 rown and 7)
+## [1] 183 7
+Often your data includes outliers or other observations which you +wish to remove before further analysis. Or perhaps you simply wish to +work with some subset of your data.
+In the learning2014 data the variable ‘points’ +denotes the students exam points in a statistics course exam. If the +student did not attend an exam, the value of ‘points’ will be zero. We +will remove these observations from the data.
+male_students
by selecting
+the male students from learning2014
learning2014
and select rows where the
+‘points’ variable is greater than zero.Hint: - The “greater than” logical operator is >
dim(lrn14)
+## [1] 183 64
+dim(learning2014)
+## [1] 166 7
+#Export csv file
+setwd("~/Documents/GitHub/IODS-project")
+write_csv(learning2014, 'learning2014.csv')
+ggplot2 is a +popular library for creating stunning graphics with R. It has some +advantages over the basic plotting system in R, mainly consistent use of +function arguments and flexible plot alteration. ggplot2 is an +implementation of Leland Wilkinson’s Grammar of Graphics — a +general scheme for data visualization.
+In ggplot2, plots may be created via the convenience function
+qplot()
where arguments and defaults are meant to be
+similar to base R’s plot()
function. More complex plotting
+capacity is available via ggplot()
, which exposes the user
+to more explicit elements of the grammar. (from wikipedia)
RStudio has a cheatsheet for +data visualization with ggplot2.
+col = gender
inside aes()
.ggtitle("<insert title here>")
to the plot with
+regression lineHints: - Use +
to add the title to the plot - The plot
+with regression line is saved in the object p3
- You can
+draw the plot by typing the object name where the plot is saved
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# learning2014 is available
+
+# Access the gglot2 library
+library(ggplot2)
+
+# initialize plot with data and aesthetic mapping
+p1 <- ggplot(learning2014, aes(x = attitude, y = points))
+
+# define the visualization type (points)
+p2 <- p1 + geom_point()
+
+# draw the plot
+p2
+
+# add a regression line
+p3 <- p2 + geom_smooth(method = "lm")
+
+# draw the plot
+p3
+## `geom_smooth()` using formula 'y ~ x'
+
+#Lets try and overview summary
+p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
+ # draw the plot!
+p
+
+Often the most interesting feature of your data are the relationships +between the variables. If there are only a handful of variables saved as +columns in a data frame, it is possible to visualize all of these +relationships neatly in a single plot.
+Base R offers a fast plotting function pairs()
, which
+draws all possible scatter plots from the columns of a data frame,
+resulting in a scatter plot matrix. Libraries GGally
+and ggplot2 together offer a slow but more detailed
+look at the variables, their distributions and relationships.
col
to the
+pairs()
function, defining the colour with the ‘gender’
+variable in learning2014.p
with
+ggpairs()
.mapping
of ggpairs()
+by defining col = gender
inside aes()
.alpha = 0.3
inside aes()
.Hints: - You can use $
to access a column of a data
+frame. - Remember to separate function arguments with a comma - You can
+draw the plot p
by simply typing it’s name: just like
+printing R objects.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# learning2014 is available
+
+# draw a scatter plot matrix of the variables in learning2014.
+# [-1] excludes the first column (gender)
+pairs(learning2014[-1])
+
+# access the GGally and ggplot2 libraries
+library(GGally)
+library(ggplot2)
+
+# create a more advanced plot matrix with ggpairs()
+p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
+Regression
+analysis with R is easy once you have your data in a neat data
+frame. You can simply use the lm()
function to fit a linear
+model. The first argument of lm()
is a
+formula
, which defines the target variable and the
+explanatory variable(s).
The formula should be y ~ x
, where y
is the
+target (or outcome) variable and x
the explanatory variable
+(predictor). The second argument of lm()
is
+data
, which should be a data frame where y
and
+x
are columns.
The output of lm()
is a linear model object, which can
+be saved for later use. The generic function summary()
can
+be used to print out a summary of the model.
Hints: - Replace 1
with the name of the explanatory
+variable in the formula inside lm()
- Use
+summary()
on the model object to print out a summary
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# learning2014 is available
+
+# a scatter plot of points versus attitude
+library(ggplot2)
+qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
+## `geom_smooth()` using formula 'y ~ x'
+
+# fit a linear model
+my_model <- lm(points ~ 1, data = learning2014)
+
+# print out a summary of the model
+summary(my_model)
+##
+## Call:
+## lm(formula = points ~ 1, data = learning2014)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -15.7169 -3.7169 0.2831 5.0331 10.2831
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 22.7169 0.4575 49.65 <2e-16 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 5.895 on 165 degrees of freedom
+When there are more than one explanatory variables in the linear
+model, it is called multiple regression. In R, it is easy to include
+more than one explanatory variables in your linear model. This is done
+by simply defining more explanatory variables with the
+formula
argument of lm()
, as below
y ~ x1 + x2 + ..
+Here y
is again the target variable and
+x1, x2, ..
are the explanatory variables.
ggpairs()
points
is the target
+variable and both attitude
and stra
are the
+explanatory variables.Hint: - The variable with the third highest absolute correlation with
+points
is surf
.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# learning2014 is available
+
+library(GGally)
+library(ggplot2)
+# create an plot matrix with ggpairs()
+ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
+
+# create a regression model with multiple explanatory variables
+my_model2 <- lm(points ~ attitude + stra, data = learning2014)
+
+# print out a summary of the model
+summary(my_model2)
+##
+## Call:
+## lm(formula = points ~ attitude + stra, data = learning2014)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -17.6436 -3.3113 0.5575 3.7928 10.9295
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
+## attitude 3.4658 0.5652 6.132 6.31e-09 ***
+## stra 0.9137 0.5345 1.709 0.08927 .
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 5.289 on 163 degrees of freedom
+## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
+## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
+R makes it easy to graphically explore the validity of your model
+assumptions. If you give a linear model object as the first argument to
+the plot()
function, the function automatically assumes you
+want diagnostic plots and will produce them. You can check the help page
+of plotting an lm object by typing ?plot.lm
or
+help(plot.lm)
to the R console.
In the plot function you can then use the argument which
+to choose which plots you want. which
must be an integer
+vector corresponding to the following list of plots:
which | +graphic | +
---|---|
1 | +Residuals vs Fitted values | +
2 | +Normal QQ-plot | +
3 | +Standardized residuals vs Fitted values | +
4 | +Cook’s distances | +
5 | +Residuals vs Leverage | +
6 | +Cook’s distance vs Leverage | +
We will focus on plots 1, 2 and 5: Residuals vs Fitted values,
+Normal QQ-plot and Residuals vs Leverage.
my_model2
plot()
+function: Residuals vs Fitted values, Normal QQ-plot and Residuals vs
+Leverage using the argument which
.plot()
function, add the
+following: par(mfrow = c(2,2))
. This will place the
+following 4 graphics to the same plot. Execute the code again to see the
+effect.Hint: - You can combine integers to an integer vector with
+c()
. For example: c(1,2,3)
.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# learning2014 is available
+
+# create a regression model with multiple explanatory variables
+my_model2 <- lm(points ~ attitude + stra, data = learning2014)
+
+# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
+plot(my_model2, which = 1)
+
+plot(my_model2, which = 2)
+
+plot(my_model2, which = 3)
+
+plot(my_model2, which = 4)
+
+plot(my_model2, which = 5)
+
+plot(my_model2, which = 6)
+
+Okay, so let’s assume that we have a linear model which seems to fit +our standards. What can we do with it?
+The model quantifies the relationship between the explanatory +variable(s) and the dependent variable. The model can also be used for +predicting the dependent variable based on new observations of the +explanatory variable(s).
+In R, predicting can be done using the predict()
+function. (see ?predict
). The first argument of predict is
+a model object and the argument newdata
(a data.frame) can
+be used to make predictions based on new observations. One or more
+columns of newdata
should have the same name as the
+explanatory variables in the model object.
m
and print out a summary of the
+modelnew_attitudes
new_attitudes
predict()
the new student’s exam points based on their
+attitudes, using the newdata
argumentHints: - Type attitude = new_attitudes
inside the
+data.frame()
function. - Give the new_data
+data.frame as the newdata
argument for
+predict()
# Work with the exercise in this chunk, step-by-step. Fix the R code!
+# learning2014 is available
+
+# Create model object m
+m <- lm(points ~ attitude, data = learning2014)
+
+# print out a summary of the model
+summary(m)
+##
+## Call:
+## lm(formula = points ~ attitude, data = learning2014)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -16.9763 -3.2119 0.4339 4.1534 10.6645
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
+## attitude 3.5255 0.5674 6.214 4.12e-09 ***
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 5.32 on 164 degrees of freedom
+## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
+## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
+# New observations
+new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
+new_data <- data.frame(attitude = new_attitudes)
+
+# Print out the new data
+summary(new_data)
+## attitude
+## Min. :2.200
+## 1st Qu.:2.725
+## Median :3.350
+## Mean :3.325
+## 3rd Qu.:3.950
+## Max. :4.400
+# Predict the new students exam points based on attitude
+predict(m, newdata = new_data)
+## Mia Mike Riikka Pekka
+## 25.03390 27.14918 19.39317 21.86099
+Awesome work!
+library("boot")
+library("readr")
+Data wrangling completed using the UCI Machine Learning Repository +(https://archive.ics.uci.edu/ml/datasets.html). csv files +downloaded from the repository and final data set “alc.csv” has been +exported into the project folder. Rcodes in the learning diary has been +updated.
+New Rmd file has been created with tittle “chapter3” and now saved in +the project folder.
+library(tidyverse)
+## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.9
+## ✔ tibble 3.1.8 ✔ stringr 1.4.1
+## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
+## ✔ purrr 0.3.4
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag() masks stats::lag()
+stu_alc2014 <- read_csv("alc.csv" , show_col_types= FALSE)
+spec(stu_alc2014)
+## cols(
+## school = col_character(),
+## sex = col_character(),
+## age = col_double(),
+## address = col_character(),
+## famsize = col_character(),
+## Pstatus = col_character(),
+## Medu = col_double(),
+## Fedu = col_double(),
+## Mjob = col_character(),
+## Fjob = col_character(),
+## reason = col_character(),
+## guardian = col_character(),
+## traveltime = col_double(),
+## studytime = col_double(),
+## schoolsup = col_character(),
+## famsup = col_character(),
+## activities = col_character(),
+## nursery = col_character(),
+## higher = col_character(),
+## internet = col_character(),
+## romantic = col_character(),
+## famrel = col_double(),
+## freetime = col_double(),
+## goout = col_double(),
+## Dalc = col_double(),
+## Walc = col_double(),
+## health = col_double(),
+## failures = col_double(),
+## paid = col_character(),
+## absences = col_double(),
+## G1 = col_double(),
+## G2 = col_double(),
+## G3 = col_double(),
+## alc_use = col_double(),
+## high_use = col_logical()
+## )
+#looking at the data
+dim(stu_alc2014)
+## [1] 370 35
+colnames(stu_alc2014)
+## [1] "school" "sex" "age" "address" "famsize"
+## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
+## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
+## [16] "famsup" "activities" "nursery" "higher" "internet"
+## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
+## [26] "Walc" "health" "failures" "paid" "absences"
+## [31] "G1" "G2" "G3" "alc_use" "high_use"
+glimpse(stu_alc2014)
+## Rows: 370
+## Columns: 35
+## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
+## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
+## $ age <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
+## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
+## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
+## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
+## $ Medu <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
+## $ Fedu <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
+## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
+## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
+## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
+## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
+## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
+## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
+## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
+## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
+## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
+## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
+## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
+## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
+## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
+## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
+## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
+## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
+## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
+## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
+## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
+## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
+## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
+## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
+## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
+## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
+## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
+## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
+## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
+stu_alc2014_2 <- select(stu_alc2014, absences, G3, age, freetime, high_use)
+str(stu_alc2014_2)
+## tibble [370 × 5] (S3: tbl_df/tbl/data.frame)
+## $ absences: num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
+## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
+## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
+## $ freetime: num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
+## $ high_use: logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
+colnames(stu_alc2014_2)
+## [1] "absences" "G3" "age" "freetime" "high_use"
+In above data set we have 370 observations and 35 variables. These +data set belong to a survey from questionnaire in two secondary school +from Portugal in which different variables, demographics and +socio-economic features measures students association with alcohol +consumption. Here I choose 4 interesting variable that I think has +greater influence in the level of alcohol consumption among these school +kids. My rationale behind choosing these variables is that certain age +groups have higher access to alcohol, ages above 16 lets say can access +alcohol easily than ages below 16 so I wish to see the relation here. +Also free time activities and amount of free time influences alcohol +consumption. Likewise final grade and absences can directly correlate +with higher use. So I wish to test my model with these variables.
+#Let's explore the choose variables using box plots
+
+##Let's see for absences
+g1 <- ggplot(stu_alc2014, aes(x = high_use, y = absences))
+g1 + geom_boxplot() + ylab("Absences")
+
+##Let's see for G3(Final Grade)
+g1 <- ggplot(stu_alc2014, aes(x = high_use, y = G3))
+g1 + geom_boxplot() + ylab("Final Grade")
+
+##Let's see for age
+g1 <- ggplot(stu_alc2014, aes(x = high_use, y = age))
+g1 + geom_boxplot() + ylab("Age")
+
+##And for freetime
+g1 <- ggplot(stu_alc2014, aes(x = high_use, y = freetime))
+g1 + geom_boxplot() + ylab("Freetime")
++General overview from the plot infers some association between high +alcohol use and absence and also age (holds true). It would be +interesting to fit this kind of model to see the relationship. Final +grade and alcohol consumption shows some association but there appears +to be some difference in their mean for true and false.Free time doesn’t +seem to have much effect on alcohol consumption.
+##Lets call this model-1 (m1) which explores 4 variable
+m1 <- glm(high_use ~ absences + G3 + age + freetime, data = stu_alc2014_2, family = "binomial")
+summary(m1)
+##
+## Call:
+## glm(formula = high_use ~ absences + G3 + age + freetime, family = "binomial",
+## data = stu_alc2014_2)
+##
+## Deviance Residuals:
+## Min 1Q Median 3Q Max
+## -2.0110 -0.8181 -0.6584 1.1345 2.0343
+##
+## Coefficients:
+## Estimate Std. Error z value Pr(>|z|)
+## (Intercept) -3.80128 1.87360 -2.029 0.042472 *
+## absences 0.08428 0.02301 3.663 0.000249 ***
+## G3 -0.06623 0.03666 -1.807 0.070838 .
+## age 0.11958 0.10233 1.169 0.242592
+## freetime 0.39844 0.12559 3.172 0.001511 **
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## (Dispersion parameter for binomial family taken to be 1)
+##
+## Null deviance: 452.04 on 369 degrees of freedom
+## Residual deviance: 417.94 on 365 degrees of freedom
+## AIC: 427.94
+##
+## Number of Fisher Scoring iterations: 4
+This seems to be an interesting outcome. Fitted model seem to match +my above hypothesis based on box plot and distribution for some +variable, absences for instant is the most significant and free time is +second most significant among other variable. Absence has the highest +significance (p = 0.000249) and free time is significant with p = +0.001511. Final grade and age however doesn’t have the same lever of +significance as other two. Final grade has p = 0.05 which can be +considered significant result but comparatively, two other variable +stands out the most.
+coef(m1)
+## (Intercept) absences G3 age freetime
+## -3.80128143 0.08428256 -0.06622629 0.11957546 0.39844165
+OR <- coef(m1) %>% exp
+CI <- confint(m1) %>% exp
+## Waiting for profiling to be done...
+#Print OR and CI
+cbind(OR, CI)
+## OR 2.5 % 97.5 %
+## (Intercept) 0.02234212 0.0005401794 0.8536065
+## absences 1.08793626 1.0421786444 1.1408573
+## G3 0.93591905 0.8705458921 1.0055377
+## age 1.12701829 0.9228580718 1.3797569
+## freetime 1.48950173 1.1689712226 1.9147718
+Coefficient and Confidence Interval: Looking at the coefficient,
+final grade has negative value suggesting an opposite association
+between higher alcohol consumption which makes sense because higher
+grade can result in lower alcohol consumption. absences, age and free
+time shows positive association with free time being the most cause of
+higher alcohol consumption followed by age and absences. This is also
+supported by the odds ratio and confidence interval for each tested
+variable.
+The above hypothesis therefore holds true for most variable. It is quite
+obviously important to explore other variable to see effect of an
+additional variable on an outcome through multivariate analysis.
The topics of this chapter - clustering and classification - are +handy and visual tools of exploring statistical data. Clustering means +that some points (or observations) of the data are in some sense closer +to each other than some other points. In other words, the data points do +not comprise a homogeneous sample, but instead, it is somehow +clustered.
+In general, the clustering methods try to find these clusters (or +groups) from the data. One of the most typical clustering methods is +called k-means clustering. Also hierarchical clustering methods quite +popular, giving tree-like dendrograms as their main output.
+As such, clusters are easy to find, but what might be the “right” +number of clusters? It is not always clear. And how to give these +clusters names and interpretations?
+Based on a successful clustering, we may try to classify new +observations to these clusters and hence validate the results of +clustering. Another way is to use various forms of discriminant +analysis, which operates with the (now) known clusters, asking: “what +makes the difference(s) between these groups (clusters)?”
+In the connection of these methods, we also discuss the topic of +distance (or dissimilarity or similarity) measures. There are lots of +other measures than just the ordinary Euclidean distance, although it is +one of the most important ones. Several discrete and even binary +measures exist and are widely used for different purposes in various +disciplines.
+library(tidyverse)
+## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
+## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
+## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
+## ✔ readr 2.1.2 ✔ forcats 0.5.2
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag() masks stats::lag()
+library(GGally)
+## Registered S3 method overwritten by 'GGally':
+## method from
+## +.gg ggplot2
+library(dplyr)
+library(ggplot2)
+library(MASS)
+##
+## Attaching package: 'MASS'
+##
+## The following object is masked from 'package:dplyr':
+##
+## select
+library(corrplot)
+## corrplot 0.92 loaded
+#Loading the Boston data from the MASS package
+data("Boston")
+
+# Exploring the structure and the dimensions of the data set
+
+str(Boston)
+## 'data.frame': 506 obs. of 14 variables:
+## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
+## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
+## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
+## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
+## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
+## $ rm : num 6.58 6.42 7.18 7 7.15 ...
+## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
+## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
+## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
+## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
+## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
+## $ black : num 397 397 393 395 397 ...
+## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
+## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
+dim(Boston)
+## [1] 506 14
+The data set “Boston” from MASS library presents geographical, +demographic, structural, economic and cultural description of different +suburbs in Boston and their implication on housing values within +different suburbs. Data set contains 14 variables (factors) and 506 +observation and most variables are numerical.
+The variables in the data set which influences the housing prices +are:
+crim = per capita crime rate by town. zn = proportion of residential +land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail +business acres per town. chas = Charles River dummy variable (= 1 if +tract bounds river; 0 otherwise). nox = nitrogen oxides concentration +(parts per 10 million). rm = average number of rooms per dwelling. age = +proportion of owner-occupied units built prior to 1940. dis = weighted +mean of distances to five Boston employment centres. rad = index of +accessibility to radial highways. tax = full-value property-tax rate per +$10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - +0.63)^2 where BkBk is the proportion of blacks by town. lstat = lower +status of the population (percent). medv = median value of +owner-occupied homes in $1000s.
+Data is sourced from Harrison, D. and Rubinfeld, D.L. (1978) Hedonic +prices and the demand for clean air. J. Environ. Economics and +Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) +Regression Diagnostics. Identifying Influential Data and Sources of +Collinearity. New York: Wiley.
+For more information about the “Boston” data set follow the link https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
+#overview
+summary(Boston)
+## crim zn indus chas
+## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
+## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
+## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
+## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
+## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
+## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
+## nox rm age dis
+## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
+## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
+## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
+## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
+## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
+## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
+## rad tax ptratio black
+## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
+## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
+## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
+## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
+## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
+## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
+## lstat medv
+## Min. : 1.73 Min. : 5.00
+## 1st Qu.: 6.95 1st Qu.:17.02
+## Median :11.36 Median :21.20
+## Mean :12.65 Mean :22.53
+## 3rd Qu.:16.95 3rd Qu.:25.00
+## Max. :37.97 Max. :50.00
+#Plotting
+long_Boston <- pivot_longer(Boston, cols=everything(), names_to = 'variable', values_to = 'value')
+p1 <- ggplot(data=long_Boston)
+p1 + geom_histogram(mapping= aes(x=value)) + facet_wrap(~variable, scales="free")
+## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+
+#Relationship between the variables
+cor_matrix <- cor(Boston)
+cor_matrix
+## crim zn indus chas nox
+## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
+## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
+## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
+## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
+## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
+## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
+## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
+## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
+## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
+## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
+## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
+## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
+## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
+## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
+## rm age dis rad tax ptratio
+## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
+## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
+## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
+## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
+## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
+## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
+## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
+## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
+## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
+## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
+## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
+## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
+## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
+## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
+## black lstat medv
+## crim -0.38506394 0.4556215 -0.3883046
+## zn 0.17552032 -0.4129946 0.3604453
+## indus -0.35697654 0.6037997 -0.4837252
+## chas 0.04878848 -0.0539293 0.1752602
+## nox -0.38005064 0.5908789 -0.4273208
+## rm 0.12806864 -0.6138083 0.6953599
+## age -0.27353398 0.6023385 -0.3769546
+## dis 0.29151167 -0.4969958 0.2499287
+## rad -0.44441282 0.4886763 -0.3816262
+## tax -0.44180801 0.5439934 -0.4685359
+## ptratio -0.17738330 0.3740443 -0.5077867
+## black 1.00000000 -0.3660869 0.3334608
+## lstat -0.36608690 1.0000000 -0.7376627
+## medv 0.33346082 -0.7376627 1.0000000
+corrplot(cor_matrix, method="circle")
+
+#Let's try a mixed plot with correlation values (closer the values to 1, stronger the correlation)
+corrplot.mixed(cor_matrix, lower = 'number', upper = 'circle',tl.pos ="lt", tl.col='black', tl.cex=0.6, number.cex = 0.5)
++In the correlation matrix, Color blue represents positive and red +represents the negative correlation. The values closer to +1 and -1 +implies a stronger positive and negative correlation respectively. The +threshold is represented by different shades of blue and black and +decimal points which is indexed on the right side of the plot.
+Based on the plot we can see:
+Positive correlation between some variables for example:
+Negative correlation between some variables for example:
+Lets beging the scaling exercise to standardize the data set. Since
+the Boston data contains only numerical values, so we can use the
+function scale()
to standardize the whole dataset as
+mentioned in the exercise 4 instruction.
We also have the following equation from exercise which basically +gives us the idea on scaling i.e., subtract the column means from +corresponding means and dividing with the standard deviation.
+\[scaled(x) = \frac{x - mean(x)}{ +sd(x)}\]
+#Saving the scaled data to the object
+boston_scaled <- scale(Boston)
+
+#Lets see
+summary(boston_scaled)
+## crim zn indus chas
+## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
+## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
+## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
+## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
+## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
+## nox rm age dis
+## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
+## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
+## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
+## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
+## rad tax ptratio black
+## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
+## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
+## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
+## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
+## lstat medv
+## Min. :-1.5296 Min. :-1.9063
+## 1st Qu.:-0.7986 1st Qu.:-0.5989
+## Median :-0.1811 Median :-0.1449
+## Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
+## Max. : 3.5453 Max. : 2.9865
+# class of the boston_scaled object
+class(boston_scaled)
+## [1] "matrix" "array"
+# change the object to data frame
+boston_scaled <- as.data.frame(boston_scaled)
+Scaling the data set with various scales makes the analyses easier. +Scale() transforms the data into a matrix and array so for further +analysis we can change it back to the data frame
+Let’s scale further by creating a quantile vector of crim and print +it
+bins <- quantile(boston_scaled$crim)
+bins
+## 0% 25% 50% 75% 100%
+## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
+# create a categorical variable 'crime'
+crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
+
+# look at the table of the new factor crime
+table(crime)
+## crime
+## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
+## 127 126 126 127
+# remove original crim from the dataset
+boston_scaled <- dplyr::select(boston_scaled, -crim)
+
+# add the new categorical value to scaled data
+boston_scaled <- data.frame(boston_scaled, crime)
+
+# let's see the new data set now !!
+summary(boston_scaled)
+## zn indus chas nox
+## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
+## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
+## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
+## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
+## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
+## rm age dis rad
+## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
+## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
+## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
+## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
+## tax ptratio black lstat
+## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
+## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
+## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
+## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
+## medv crime
+## Min. :-1.9063 [-0.419,-0.411]:127
+## 1st Qu.:-0.5989 (-0.411,-0.39] :126
+## Median :-0.1449 (-0.39,0.00739]:126
+## Mean : 0.0000 (0.00739,9.92] :127
+## 3rd Qu.: 0.2683
+## Max. : 2.9865
+Now as mention in the exercise set “Divide and conquer”, lets divide +train and test sets: training -> 80% and testing -> 20%
+# number of rows in the Boston data set
+n <- nrow(boston_scaled)
+n
+## [1] 506
+# choose randomly 80% of the rows
+ind <- sample(n, size = n * 0.8)
+
+# create train set
+train <- boston_scaled[ind,]
+
+# create test set
+test <- boston_scaled[-ind,]
+
+# save the correct classes from test data
+correct_classes <- test$crime
+
+# remove the crime variable from test data
+test <- dplyr::select(test, -crime)
+Let’s move on fitting LDA on the training set.
+Our target variable: crime(categorical)
+# linear discriminant analysis
+# Other variables are designated (.)
+lda.fit <- lda(crime ~ ., data = train)
+
+# print the lda.fit object
+lda.fit
+## Call:
+## lda(crime ~ ., data = train)
+##
+## Prior probabilities of groups:
+## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
+## 0.2475248 0.2425743 0.2648515 0.2450495
+##
+## Group means:
+## zn indus chas nox rm
+## [-0.419,-0.411] 0.89447554 -0.8958692 -0.15421606 -0.8856586 0.4200376
+## (-0.411,-0.39] -0.07378209 -0.3153201 -0.03128211 -0.5402615 -0.1894835
+## (-0.39,0.00739] -0.37343545 0.2036824 0.27960087 0.3978276 0.1189162
+## (0.00739,9.92] -0.48724019 1.0171737 -0.15302300 1.0420418 -0.4325790
+## age dis rad tax ptratio
+## [-0.419,-0.411] -0.8914002 0.9357258 -0.7004968 -0.6996510 -0.45106613
+## (-0.411,-0.39] -0.2556164 0.3769261 -0.5377192 -0.4763222 -0.07796912
+## (-0.39,0.00739] 0.4301848 -0.3891695 -0.3936844 -0.2751204 -0.33430767
+## (0.00739,9.92] 0.7949744 -0.8417167 1.6375616 1.5136504 0.78011702
+## black lstat medv
+## [-0.419,-0.411] 0.36979482 -0.750037404 0.49756898
+## (-0.411,-0.39] 0.31566531 -0.065445306 -0.03607504
+## (-0.39,0.00739] 0.09089154 0.007096295 0.19478879
+## (0.00739,9.92] -0.86757833 0.978459544 -0.71986628
+##
+## Coefficients of linear discriminants:
+## LD1 LD2 LD3
+## zn 0.14031540 0.598096615 -0.92383757
+## indus -0.02930192 -0.189172614 0.22826688
+## chas -0.10117097 -0.142233787 0.03149598
+## nox 0.41573233 -0.794904909 -1.25110968
+## rm -0.11314539 -0.023087098 -0.22031401
+## age 0.33303790 -0.386946830 0.10834464
+## dis -0.08707448 -0.225670182 0.26125036
+## rad 3.05322455 0.876145452 0.07643080
+## tax -0.04818628 0.024343114 0.23153680
+## ptratio 0.13124250 -0.007118077 -0.25133969
+## black -0.14400936 -0.003522677 0.13875654
+## lstat 0.21595687 -0.113343967 0.39891878
+## medv 0.18585476 -0.356563842 -0.20147563
+##
+## Proportion of trace:
+## LD1 LD2 LD3
+## 0.9480 0.0404 0.0116
+There are three discriminant function LD(proportion of trace) as +follow:
+LD1 LD2 LD3 0.9489 0.0362 0.0150
+# the function for lda biplot arrows
+lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
+ heads <- coef(x)
+ arrows(x0 = 0, y0 = 0,
+ x1 = myscale * heads[,choices[1]],
+ y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
+ text(myscale * heads[,choices], labels = row.names(heads),
+ cex = tex, col=color, pos=3)
+}
+
+# target classes as numeric
+classes <- as.numeric(train$crime)
+
+# plot the lda results
+plot(lda.fit, dimen = 2, col = classes, pch = classes)
+lda.arrows(lda.fit, myscale = 1)
+
+library(MASS)
+
+ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
+
+train <- boston_scaled[ind,]
+
+test <- boston_scaled[-ind,]
+
+correct_classes <- test$crime
+
+test <- dplyr::select(test, -crime)
+
+lda.fit = lda(crime ~ ., data=train)
+
+# predict classes with test data
+lda.pred <- predict(lda.fit, newdata = test)
+
+# cross tabulate the results
+table(correct = correct_classes, predicted = lda.pred$class)
+## predicted
+## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
+## [-0.419,-0.411] 23 9 1 0
+## (-0.411,-0.39] 2 17 2 0
+## (-0.39,0.00739] 1 10 14 3
+## (0.00739,9.92] 0 0 0 20
+Let’s work with clustering now Let’s calculate also the distances +between observations to assess the similarity between data points. As +instructed we calculated two distance matrix euclidean and +manhattan.
+#start by reloading the Boston data set
+data("Boston")
+
+boston_scaled <- scale(Boston)
+summary(boston_scaled)
+## crim zn indus chas
+## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
+## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
+## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
+## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
+## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
+## nox rm age dis
+## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
+## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
+## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
+## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
+## rad tax ptratio black
+## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
+## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
+## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
+## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
+## lstat medv
+## Min. :-1.5296 Min. :-1.9063
+## 1st Qu.:-0.7986 1st Qu.:-0.5989
+## Median :-0.1811 Median :-0.1449
+## Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
+## Max. : 3.5453 Max. : 2.9865
+#class of Boston_scaled object
+class(boston_scaled)
+## [1] "matrix" "array"
+# change the object to data frame for futher analysis
+boston_scaled <- as.data.frame(boston_scaled)
+# Calculating distances between the observation
+# euclidean distance matrix
+dist_eu <- dist(boston_scaled, method = "euclidean")
+
+# look at the summary of the distances
+summary(dist_eu)
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
+# Manhattan distance matrix
+dist_man <- dist(boston_scaled, method = "manhattan")
+
+# look at the summary of the distances
+summary(dist_man)
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
+Lets do K-means clustering (first with 4 clusters and then 3)
+set.seed(123) #function set.seed() is used here to deal with the random assigning of the initial cluster centers when conducting k-means clustering
+
+# k-means clustering
+km <- kmeans(boston_scaled, centers = 4)
+
+# plotting the scaled Boston data set with clusters
+pairs(boston_scaled, col = km$cluster)
++With 4 clusters, it appears that there is some overlapping between +clusters. lets try 3 clusters and check
+# k-means clustering
+km <- kmeans(boston_scaled, centers = 3)
+
+# plotting the scaled Boston data set with clusters
+pairs(boston_scaled, col = km$cluster)
++with 3 clusters it is even worse !! We can try to determine the optimal +number of cluster for our model and see how it works
+#Investigating the optimal number of clusters
+set.seed(123) #setseed function again
+
+# determine the number of clusters
+k_max <- 10
+
+# calculate the total within sum of squares
+twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
+
+# visualize the results
+qplot(x = 1:k_max, y = twcss, geom = 'line', ylab = "TWCSS", xlab = "Number of cluster")
+
+From the plot it looks like the optimal number of clusters for our +model is 2-3 as there is a sharp drop in TWCSS values. We can try with 2 +because 2 seems more optimal as by 2.5 the drop is significant and we +can see the substantial change.
+# k-means clustering
+km <- kmeans(boston_scaled, centers = 2)
+
+# plot the Boston data set with clusters predicted
+pairs(boston_scaled, col = km$cluster)
+
+pairs(boston_scaled[6:10], col = km$cluster)
++With 2 clusters, the model seem better, there is good separation for +example rad and tax. rm and age seems ok as well.
+Actually, a fairly large selection of statistical methods can be +listed under the title “dimensionality reduction techniques”. Most often +(nearly always, that is!) the real-world phenomena are multidimensional: +they may consist of not just two or three but 5 or 10 or 20 or 50 (or +more) dimensions. Of course, we are living only in a three-dimensional +(3D) world, so those multiple dimensions may really challenge our +imagination. It would be easier to reduce the number of dimensions in +one way or another.
+We shall now learn the basics of two data science based ways of +reducing the dimensions. The principal method here is principal +component analysis (PCA), which reduces any number of measured +(continuous) and correlated variables into a few uncorrelated components +that collect together as much variance as possible from the original +variables. The most important components can be then used for various +purposes, e.g., drawing scatterplots and other fancy graphs that would +be quite impossible to achieve with the original variables and too many +dimensions.
+Multiple correspondence analysis (MCA) and other variations of CA +bring us similar possibilities in the world of discrete variables, even +nominal scale (classified) variables, by finding a suitable +transformation into continuous scales and then reducing the dimensions +quite analogously with the PCA. The typical graphs show the original +classes of the discrete variables on the same “map”, making it possible +to reveal connections (correspondences) between different things that +would be quite impossible to see from the corresponding cross tables +(too many numbers!).
+Briefly stated, these methods help to visualize and understand +multidimensional phenomena by reducing their dimensionality that may +first feel impossible to handle at all.
+date()
+## [1] "Tue Dec 6 21:29:41 2022"
+Lets start !!
+#load the packages
+
+library(tidyverse)
+## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
+## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
+## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
+## ✔ readr 2.1.2 ✔ forcats 0.5.2
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag() masks stats::lag()
+library(GGally)
+## Registered S3 method overwritten by 'GGally':
+## method from
+## +.gg ggplot2
+library(dplyr)
+library(ggplot2)
+library(corrplot)
+## corrplot 0.92 loaded
+library(stringr)
+library(psych)
+##
+## Attaching package: 'psych'
+##
+## The following objects are masked from 'package:ggplot2':
+##
+## %+%, alpha
+library(FactoMineR)
+library(tidyr)
+#loading from project folder
+human_ <- read.csv('human_.csv', row.names = 1)
+
+#Alternatively url is given in the instruction page, so we can also use that !!
+# human <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", row.names = 1)
+
+#lets check how the data looks
+str(human_);dim(human_)
+## 'data.frame': 155 obs. of 8 variables:
+## $ SeEdu_FM: num 1.007 0.997 0.983 0.989 0.969 ...
+## $ LFR_FM : num 0.891 0.819 0.825 0.884 0.829 ...
+## $ Life_Exp: num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
+## $ Exp_Edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
+## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
+## $ MMR : int 4 6 6 5 6 7 9 28 11 8 ...
+## $ ABR : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
+## $ X.PR : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
+## [1] 155 8
+colnames(human_)
+## [1] "SeEdu_FM" "LFR_FM" "Life_Exp" "Exp_Edu" "GNI" "MMR" "ABR"
+## [8] "X.PR"
+summary(human_)
+## SeEdu_FM LFR_FM Life_Exp Exp_Edu
+## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
+## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
+## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
+## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
+## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
+## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
+## GNI MMR ABR X.PR
+## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
+## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
+## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
+## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
+## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
+## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
+Plot1 <- p1 <- ggpairs(human_, mapping = aes(alpha=0.5), title="summary plot",lower = list(combo = wrap("facethist", bins = 25)))
+Plot1
+
+#Lets see the correlation matrix with corrplot, I have used same method as last week's exercise with some changes
+Plot2 <- cor(human_, method='spearman')
+Plot2
+## SeEdu_FM LFR_FM Life_Exp Exp_Edu GNI MMR
+## SeEdu_FM 1.00000000 -0.07525177 0.4987487 0.52316296 0.5663235 -0.4937468
+## LFR_FM -0.07525177 1.00000000 -0.1535502 -0.03956409 -0.1414489 0.1213061
+## Life_Exp 0.49874873 -0.15355015 1.0000000 0.81053640 0.8361208 -0.8753753
+## Exp_Edu 0.52316296 -0.03956409 0.8105364 1.00000000 0.8495071 -0.8472241
+## GNI 0.56632346 -0.14144887 0.8361208 0.84950709 1.0000000 -0.8647968
+## MMR -0.49374677 0.12130614 -0.8753753 -0.84722412 -0.8647968 1.0000000
+## ABR -0.35514985 0.09705612 -0.7468356 -0.72251512 -0.7562608 0.8315049
+## X.PR 0.09722070 0.20545990 0.2523837 0.22396862 0.1778011 -0.2028040
+## ABR X.PR
+## SeEdu_FM -0.35514985 0.0972207
+## LFR_FM 0.09705612 0.2054599
+## Life_Exp -0.74683557 0.2523837
+## Exp_Edu -0.72251512 0.2239686
+## GNI -0.75626078 0.1778011
+## MMR 0.83150492 -0.2028040
+## ABR 1.00000000 -0.1214131
+## X.PR -0.12141308 1.0000000
+corrplot.mixed(Plot2, lower = 'number', upper = 'ellipse',tl.pos ="lt", tl.col='black', tl.cex=0.8, number.cex = 0.7)
++In above, correlation plot, I have used ellipse method to visualize the +relationship between different variables. The correlation is stronger +when the ellipse are narrower and two color spectrum blue and red +represents positive and negative correlation respectively.
+Statistically significant strong positive correlation from two plots +are between variables
+Statistically significant strong negative correlation from two plots +are between variables
+Two variables; labor Force Mortality of male and female combined +(LFR_FM) and percentage of female representatives in the parliament +(%PR) doesn’t show any correlation and therefore are not associated with +any other variables. Like wise secondary education of male and female +combines (SeEdu_FM) isn’t associated strongly with any other +variables.
+Principal +Component Analysis (PCA) can be performed by two sightly different +matrix decomposition methods from linear algebra: the Eigenvalue +Decomposition and the Singular +Value Decomposition (SVD).
+There are two functions in the default package distribution of R that
+can be used to perform PCA: princomp()
and
+prcomp()
. The prcomp()
function uses the SVD
+and is the preferred, more numerically accurate method. Both methods
+quite literally decompose a data matrix into a product of
+smaller matrices, which let’s us extract the underlying
+principal components. This makes it possible to
+approximate a lower dimensional representation of the data by choosing
+only a few principal components.
Lets follow the instruction from course material and
+# lets create `human_std` by standardizing the variables in `human`
+human_std <- scale(human_)
+
+# print out summaries of the standardized variables
+summary(human_std)
+## SeEdu_FM LFR_FM Life_Exp Exp_Edu
+## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
+## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
+## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
+## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
+## GNI MMR ABR X.PR
+## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
+## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
+## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
+## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
+## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
+## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
+# perform principal component analysis (with the SVD method)
+pca_human <- prcomp(human_std)
+pca_human
+## Standard deviations (1, .., p=8):
+## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
+## [8] 0.3222406
+##
+## Rotation (n x k) = (8 x 8):
+## PC1 PC2 PC3 PC4 PC5 PC6
+## SeEdu_FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585 0.17713316
+## LFR_FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067 -0.03500707
+## Life_Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935 -0.42242796
+## Exp_Edu -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678 -0.38606919
+## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306 0.11120305
+## MMR 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657 0.17370039
+## ABR 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068 -0.76056557
+## X.PR -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699 0.13749772
+## PC7 PC8
+## SeEdu_FM 0.05773644 0.16459453
+## LFR_FM -0.22729927 -0.07304568
+## Life_Exp -0.43406432 0.62737008
+## Exp_Edu 0.77962966 -0.05415984
+## GNI -0.13711838 -0.16961173
+## MMR 0.35380306 0.72193946
+## ABR -0.06897064 -0.14335186
+## X.PR 0.00568387 -0.02306476
+summary(pca_human)
+## Importance of components:
+## PC1 PC2 PC3 PC4 PC5 PC6 PC7
+## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
+## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
+## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
+## PC8
+## Standard deviation 0.32224
+## Proportion of Variance 0.01298
+## Cumulative Proportion 1.00000
+# draw a biplot of the principal component representation and the original variables
+biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("darkred", "darkgreen"))
++From the plot we can see the variability captured by the principal +components which seems to have a realistic distribution between the +principal components.
+Summary of the result
+From the plot (rounded with %)
+PC1 = 53.61 % variance PC2 = 16.24 % variance PC3 = 9.57 % variance +PC4 = 7.58 % variance PC5 = 5.47 % variance PC6 = 3.59 % variance PC7 = +2.63 % variance PC8 = 1.29 % variance
+And the standard deviation (SD)
+ PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
+Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 +0.45900 0.32224
+Summary of PC1-PC8 rounded with percentage (2 Decimal points only) is +elaborated above
+PC1 gives the most (53,61%) and PC8 gives the least (1.29%) of the +variability in the data set
+The variables affect mostly based PC1-PC8 are (explained as an +example from table in the summary)
+Exp_Edu (positive effect) GNI (positive effect) Life_exp (positive +effect) SeEdu_FM (positive effect) MMR (negative effect) ABR (negative +effect)
+The tea data comes from the FactoMineR package and it is measured +with a questionnaire on tea: 300 individuals were asked how they drink +tea (18 questions) and what are their product’s perception (12 +questions). In addition, some personal details were asked (4 +questions).
+The Factominer +package contains functions dedicated to multivariate explanatory data +analysis. It contains for example methods (Multiple) Correspondence +analysis , Multiple Factor analysis as well as PCA.
+In the next exercises we are going to use the tea
+dataset. The dataset contains the answers of a questionnaire on tea
+consumption.
Let’s dwell in teas for a bit!
+tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
+view(tea)
+str(tea);dim(tea)
+## 'data.frame': 300 obs. of 36 variables:
+## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
+## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
+## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
+## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
+## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
+## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
+## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
+## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
+## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
+## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
+## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
+## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
+## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
+## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
+## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
+## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
+## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
+## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
+## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
+## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
+## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
+## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
+## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
+## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
+## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
+## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
+## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
+## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
+## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
+## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
+## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
+## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
+## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
+## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
+## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
+## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
+## [1] 300 36
+colnames(tea)
+## [1] "breakfast" "tea.time" "evening" "lunch"
+## [5] "dinner" "always" "home" "work"
+## [9] "tearoom" "friends" "resto" "pub"
+## [13] "Tea" "How" "sugar" "how"
+## [17] "where" "price" "age" "sex"
+## [21] "SPC" "Sport" "age_Q" "frequency"
+## [25] "escape.exoticism" "spirituality" "healthy" "diuretic"
+## [29] "friendliness" "iron.absorption" "feminine" "sophisticated"
+## [33] "slimming" "exciting" "relaxing" "effect.on.health"
+summary(tea)
+## breakfast tea.time evening lunch
+## breakfast :144 Not.tea time:131 evening :103 lunch : 44
+## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
+##
+##
+##
+##
+##
+## dinner always home work
+## dinner : 21 always :103 home :291 Not.work:213
+## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
+##
+##
+##
+##
+##
+## tearoom friends resto pub
+## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
+## tearoom : 58 Not.friends:104 resto : 79 pub : 63
+##
+##
+##
+##
+##
+## Tea How sugar how
+## black : 74 alone:195 No.sugar:155 tea bag :170
+## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
+## green : 33 milk : 63 unpackaged : 36
+## other: 9
+##
+##
+##
+## where price age sex
+## chain store :192 p_branded : 95 Min. :15.00 F:178
+## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
+## tea shop : 30 p_private label: 21 Median :32.00
+## p_unknown : 12 Mean :37.05
+## p_upscale : 53 3rd Qu.:48.00
+## p_variable :112 Max. :90.00
+##
+## SPC Sport age_Q frequency
+## employee :59 Not.sportsman:121 +60 :38 +2/day :127
+## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
+## non-worker :64 25-34:69 1/day : 95
+## other worker:20 35-44:40 3 to 6/week: 34
+## senior :35 45-59:61
+## student :70
+## workman :12
+## escape.exoticism spirituality healthy
+## escape-exoticism :142 Not.spirituality:206 healthy :210
+## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
+##
+##
+##
+##
+##
+## diuretic friendliness iron.absorption
+## diuretic :174 friendliness :242 iron absorption : 31
+## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
+##
+##
+##
+##
+##
+## feminine sophisticated slimming exciting
+## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
+## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
+##
+##
+##
+##
+##
+## relaxing effect.on.health
+## No.relaxing:113 effect on health : 66
+## relaxing :187 No.effect on health:234
+##
+##
+##
+##
+##
+# lets work with some variables
+keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
+
+
+# select the 'keep_columns' to create a new data set
+tea_time <- dplyr::select(tea, all_of(keep_columns))
+
+# look at the summaries and structure of the data
+summary(tea_time)
+## Tea How how sugar
+## black : 74 alone:195 tea bag :170 No.sugar:155
+## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
+## green : 33 milk : 63 unpackaged : 36
+## other: 9
+## where lunch
+## chain store :192 lunch : 44
+## chain store+tea shop: 78 Not.lunch:256
+## tea shop : 30
+##
+# visualize the data set
+pivot_longer(tea_time, cols = everything()) %>%
+ ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
+ geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
+
+# multiple correspondence analysis
+#library(FactoMineR), package is loaded above already, this just as note !!
+
+mca <- MCA(tea_time, graph = FALSE)
+
+# summary of the model
+summary(mca)
+##
+## Call:
+## MCA(X = tea_time, graph = FALSE)
+##
+##
+## Eigenvalues
+## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
+## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
+## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
+## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
+## Dim.8 Dim.9 Dim.10 Dim.11
+## Variance 0.141 0.117 0.087 0.062
+## % of var. 7.705 6.392 4.724 3.385
+## Cumulative % of var. 85.500 91.891 96.615 100.000
+##
+## Individuals (the 10 first)
+## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
+## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
+## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
+## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
+## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
+## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
+## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
+## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
+## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
+## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
+## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
+## ctr cos2
+## 1 0.163 0.104 |
+## 2 0.735 0.314 |
+## 3 0.062 0.069 |
+## 4 0.068 0.073 |
+## 5 0.062 0.069 |
+## 6 0.062 0.069 |
+## 7 0.062 0.069 |
+## 8 0.735 0.314 |
+## 9 0.007 0.003 |
+## 10 0.643 0.261 |
+##
+## Categories (the 10 first)
+## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
+## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
+## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
+## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
+## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
+## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
+## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
+## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
+## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
+## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
+## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
+## v.test Dim.3 ctr cos2 v.test
+## black 0.929 | -1.081 21.888 0.382 -10.692 |
+## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
+## green -5.669 | -0.108 0.098 0.001 -0.659 |
+## alone -6.164 | -0.113 0.627 0.024 -2.655 |
+## lemon 3.226 | 1.329 14.771 0.218 8.081 |
+## milk 2.422 | 0.013 0.003 0.000 0.116 |
+## other 5.534 | -2.524 14.526 0.197 -7.676 |
+## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
+## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
+## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
+##
+## Categorical variables (eta2)
+## Dim.1 Dim.2 Dim.3
+## Tea | 0.126 0.108 0.410 |
+## How | 0.076 0.190 0.394 |
+## how | 0.708 0.522 0.010 |
+## sugar | 0.065 0.001 0.336 |
+## where | 0.702 0.681 0.055 |
+## lunch | 0.000 0.064 0.111 |
+# visualize MCA
+plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
+
+mca
+## **Results of the Multiple Correspondence Analysis (MCA)**
+## The analysis was performed on 300 individuals, described by 6 variables
+## *The results are available in the following objects:
+##
+## name description
+## 1 "$eig" "eigenvalues"
+## 2 "$var" "results for the variables"
+## 3 "$var$coord" "coord. of the categories"
+## 4 "$var$cos2" "cos2 for the categories"
+## 5 "$var$contrib" "contributions of the categories"
+## 6 "$var$v.test" "v-test for the categories"
+## 7 "$ind" "results for the individuals"
+## 8 "$ind$coord" "coord. for the individuals"
+## 9 "$ind$cos2" "cos2 for the individuals"
+## 10 "$ind$contrib" "contributions of the individuals"
+## 11 "$call" "intermediate results"
+## 12 "$call$marge.col" "weights of columns"
+## 13 "$call$marge.li" "weights of rows"
+plotellipses(mca)
++I have only chosen selected variables here. From the selected +categories, in category where , chain store and tea shop seem to be +favored. Likewise in category how, milk tea, alone and other (undefined) +seemed preferred. Also how, tea bag, un-packaged seem to be +preferred.
+After working hard with multivariate, mostly exploratory, even +heuristic techniques that are common in data science, the last topic of +IODS course will take us back in the task of building statistical +models.
+The new challenge is that the data will include two types of +dependencies simultaneously: In addition to the correlated variables +that we have faced with all models and methods so far, the observations +of the data will also be correlated with each other.
+Lets start this weeks session
+Usually, we can assume that the observations are not correlated - +instead, they are assumed to be independent of each other. However, in +longitudinal settings this assumption seldom holds, +because we have multiple observations or measurements of the same +individuals. The concept of repeated measures highlights this phenomenon +that is actually quite typical in many applications. Both types of +dependencies (variables and observations) must be taken into account; +otherwise the models will be biased.
+ +To analyze this kind of data sets, we will focus on a single class of +methods, called linear mixed effects models that can +cope quite nicely with the setting described above.
+Before we consider two examples of mixed models, namely the random +intercept model and the random intercept and slope model, we will learn +how to wrangle longitudinal data in wide form and long form, take a look +at some graphical displays of longitudinal data, and try a simple +summary measure approach that may sometimes provide a useful first step +in these challenges. In passing, we “violently” apply the usual “fixed” +models (although we know that they are not the right choice here) in +order to compare the results and see the consequences of making invalid +assumptions.
+Load the packages first !!
+#load required packages
+library(ggplot2)
+library(corrplot)
+## corrplot 0.92 loaded
+library(tidyverse)
+## ── Attaching packages
+## ───────────────────────────────────────
+## tidyverse 1.3.2 ──
+## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
+## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
+## ✔ readr 2.1.2 ✔ forcats 0.5.2
+## ✔ purrr 0.3.4
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag() masks stats::lag()
+library(GGally)
+## Registered S3 method overwritten by 'GGally':
+## method from
+## +.gg ggplot2
+library(dplyr)
+library(stringr)
+library(psych)
+##
+## Attaching package: 'psych'
+##
+## The following objects are masked from 'package:ggplot2':
+##
+## %+%, alpha
+library(FactoMineR)
+library(lme4)
+## Loading required package: Matrix
+##
+## Attaching package: 'Matrix'
+##
+## The following objects are masked from 'package:tidyr':
+##
+## expand, pack, unpack
+That’s a lot of packages !!!
+ +Lets implement the analyses of Chapter 8 of MABS, using the R codes +of Exercise Set 6: Meet and Repeat: PART I but using the RATS data (from +Chapter 9 and Meet and Repeat: PART II) as instructed in the Moodle.
+# read long format data of rats
+RATSL <- read.csv('RATSL.csv')
+
+# Lets convert categorical data to factors first
+## WE have ID and Group (Just like wrangling exercise)
+RATSL$ID <- factor(RATSL$ID)
+RATSL$Group <- factor(RATSL$Group)
+
+# glimpse and dimensions
+head(RATSL);dim(RATSL)
+## ID Group WD Weight Time
+## 1 1 1 WD1 240 1
+## 2 2 1 WD1 225 1
+## 3 3 1 WD1 245 1
+## 4 4 1 WD1 260 1
+## 5 5 1 WD1 255 1
+## 6 6 1 WD1 260 1
+## [1] 176 5
+str(RATSL)
+## 'data.frame': 176 obs. of 5 variables:
+## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
+## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
+## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
+## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
+## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
+summary(RATSL)
+## ID Group WD Weight Time
+## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
+## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
+## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
+## 4 : 11 Mean :384.5 Mean :33.55
+## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
+## 6 : 11 Max. :628.0 Max. :64.00
+## (Other):110
+The data set contains observation from 6 rats and 11 observation of +change in weight by Time. They are divided into 3 groups based on +treatment. Weight in this case is the outcome variable in this +longitudinal study. The idea is to analyse the weight difference in +three group over time.
+ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
+ geom_line(aes(linetype = Group))+
+ scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10))+
+ scale_y_continuous(name = "Weight (grams)")+
+ theme(legend.position = "top")
+
+# Draw the plot ##We plot the Weights of each rat by time and groups
+# Rats are divided into three groups
+ggplot(RATSL, aes(x = Time, y = Weight, linetype = Group, color = ID)) +
+ geom_line() +
+ scale_linetype_manual(values = rep(1:10, times=4)) +
+ facet_grid(. ~ Group, labeller = label_both) +
+ theme(legend.position = "none") +
+ scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
++From the plot we can see that Group 1 has the most rats with lowest +weight even at starting point (time of recruitment). Group 2 the most +incremental weight outcome compare to baseline but has only 4 rats. +Group 3 has also 4 rats with almost same weight range as Group 2, +however the weight doesn’t seem to increase significantly as group +2.
+Higher baseline values means higher values throughout the study.This +phenomenon is generally referred to as tracking.
+The tracking phenomenon can be seen more clearly in a plot of the +standardized values of each observation, i.e.,
+\[standardised(x) = \frac{x - mean(x)}{ +sd(x)}\]
+# Standardize the variable weight by groups
+RATSL <- RATSL %>%
+ group_by(Time) %>%
+ mutate(Weight_std = (scale(Weight))) %>%
+ ungroup()
+
+# Glimpse the data
+glimpse(RATSL)
+## Rows: 176
+## Columns: 6
+## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2…
+## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,…
+## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD…
+## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555,…
+## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8,…
+## $ Weight_std <dbl[,1]> <matrix[26 x 1]>
+head(RATSL)
+## # A tibble: 6 × 6
+## ID Group WD Weight Time Weight_std[,1]
+## <fct> <fct> <chr> <int> <int> <dbl>
+## 1 1 1 WD1 240 1 -1.00
+## 2 2 1 WD1 225 1 -1.12
+## 3 3 1 WD1 245 1 -0.961
+## 4 4 1 WD1 260 1 -0.842
+## 5 5 1 WD1 255 1 -0.882
+## 6 6 1 WD1 260 1 -0.842
+# Plot again with the standardized weight in RATSL
+ggplot(RATSL, aes(x = Time, y = Weight_std, linetype = Group, color =ID)) +
+ geom_line() +
+ scale_linetype_manual(values = rep(1:10, times=4)) +
+ facet_grid(. ~ Group, labeller = label_both) +
+ theme(legend.position = "none")
+
+ scale_y_continuous(name = "standardized Weight")
+## <ScaleContinuousPosition>
+## Range:
+## Limits: 0 -- 1
+The weight difference looks similar now after the +standardization,
+With large numbers of observations, graphical displays of individual +response profiles are of little use and investigators then commonly +produce graphs showing average (mean) profiles for each treatment group +along with some indication of the variation of the observations at each +time point, in this case the standard error of mean
+\[se = \frac{sd(x)}{\sqrt{n}}\]
+# Summary data with mean and standard error of RATSl Weight by group and time
+RATSS <- RATSL %>%
+ group_by(Group, Time) %>%
+ summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(length(Weight))) ) %>% #using formula above;
+ ungroup()
+## `summarise()` has grouped output by 'Group'. You can override using the
+## `.groups` argument.
+# Glimpse the data
+glimpse(RATSL)
+## Rows: 176
+## Columns: 6
+## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2…
+## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,…
+## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD…
+## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555,…
+## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8,…
+## $ Weight_std <dbl[,1]> <matrix[26 x 1]>
+# Plot the mean profiles
+ggplot(RATSS, aes(x = Time, y = mean, color=Group, linetype = Group, shape = Group)) +
+ geom_line() +
+ scale_linetype_manual(values = c(1,2,3)) +
+ geom_point(size=3) +
+ scale_shape_manual(values = c(1,2,3)) +
+ geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
+ theme(legend.position = "right") +
+ scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
++From the plot we can see ,All groups are independent and doesn’t seem to +overlap with each other. There is a signifiant difference in Group 1 +compared to Group 2 and Group 3. It is also clear that the weight of the +rat seems to increase over time (observation) with a significant +increase in Group 2 and 3.
+Using the summary measure approach we will look into the post +treatment values of the RATSL data set. Lets look into the mean weight +for each rat. The mean of weeks will be our summary measure and we’ll +plot boxplots of the mean for each diet group which is our treatment +measure.
+# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
+RATSL8S <- RATSL %>%
+ filter(Time > 0) %>%
+ group_by(Group, ID) %>%
+ summarise(Weight_mean = mean(Weight)) %>%
+ ungroup()
+## `summarise()` has grouped output by 'Group'. You can override using the
+## `.groups` argument.
+# Glimpse the data
+glimpse(RATSL8S)
+## Rows: 16
+## Columns: 3
+## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
+## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
+## $ Weight_mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
+# Draw a boxplot of the mean versus treatment
+
+ggplot(RATSL8S, aes(x = Group, y = Weight_mean, color = Group)) +
+ geom_boxplot() +
+ stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
+ scale_y_continuous(name = "mean(Weight), Days 1-60")
++From the box plot, we can see all three groups has outliers. Group 2 has +a large one making the uneven distribution. The next step is to find and +filter the outliers identified above.
+# define outlier from group 3
+g3 <- RATSL8S %>% filter(Group==3)
+out3 <- min(g3$Weight_mean)
+
+# Create a new data by filtering the outliers
+RATSL8S2 <- RATSL8S %>%
+ filter(250 < Weight_mean & Weight_mean < 560 & Weight_mean != out3)
+
+
+# Draw a boxplot of the mean versus diet
+ggplot(RATSL8S2, aes(x = Group, y = Weight_mean, col=Group)) +
+ geom_boxplot() +
+ stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
+ scale_y_continuous(name = "mean(Weight) by days")
++### 6.7 T-test and ANOVA
+Although the informal graphical material presented up to now has all +indicated a lack of difference in the two treatment groups, most +investigators would still require a formal test for a difference. +Consequently we shall now apply a t-test to assess any difference +between the treatment groups, and also calculate a confidence interval +for this difference. We use the data without the +outlier created above. The t-test confirms the lack of +any evidence for a group difference. Also the 95% confidence interval is +wide and includes the zero, allowing for similar conclusions to be made. +However, T-test only tests for a statistical difference between two +groups and in the dataset above we have 3 corresponding groups to be +compared, we will therefore use a more stringent and diverse test ANOVA +which compares differences among multiple groups. ANOVA assumes +homogeniety of variance-the variance in the groups 1-3 should be +similar
+# Load original wide form rats data
+RATSL <- as_tibble(read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t'))
+RATSL$ID <- factor(RATSL$ID)
+
+# Add the baseline from the original data as a new variable to the summary data
+join_vars <- c("ID","WD1")
+RATSL8S3 <- RATSL8S2 %>%
+ left_join(RATSL[join_vars], by='ID')
+# Rename column
+RATSL8S3 <- RATSL8S3 %>%
+ rename('Weight_baseline' = 'WD1')
+
+# Fit the linear model with the mean Weight as the response
+fit2 <- lm(Weight_mean ~ Weight_baseline + Group, data = RATSL8S3)
+
+# Compute the analysis of variance table for the fitted model with anova()
+anova(fit2)
+## Analysis of Variance Table
+##
+## Response: Weight_mean
+## Df Sum Sq Mean Sq F value Pr(>F)
+## Weight_baseline 1 174396 174396 7999.137 1.384e-14 ***
+## Group 2 1707 853 39.147 3.628e-05 ***
+## Residuals 9 196 22
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+Looking at the ANOVA table, p-values < 0.05 considering a +significance level p = 0.05 at 95% CI. There seem to be significant +difference between the groups. The data however doesn’t tell us much +about differences between which groups, i.e., multiple comparison. +Usually data which follows the normal distribution curve are analysed +with ANOVA followed by tukey test for multiple comparison. However in +case of data which doesn’t follow the normal distribution curve, Kruskal +Wallis followed by Dunn’s test for multiple comparison is conducted. Now +assuming our data as normally distributed as we have been doing in this +exercise, we can perform a tukey’s test for multiple comparison.
+Lets implement Implement the analyses of Chapter 9 of MABS, using the +R codes of Exercise Set 6: Meet and Repeat: PART II, but using the BPRS +data (from Chapter 8 and Meet and Repeat: PART I) as instructed in the +Moodle.
+BPRS data includes data pertaining to a brief psychiatric rating +scale (BPRS) score prior to treatment and BPRS from 8 weeks during +treatment. The patients (n=40) have been randomly assigned to treatment +arm 1 or 2 and we are interested whether there is a difference in BPRS +scores depending on the received treatment. A lower score means less +symptoms.
+Lets load and explore the data first
+BPRSL <- read.table("BPRSL.csv", header = T, sep=",")
+
+# Factor treatment & subject
+BPRSL$treatment <- factor(BPRSL$treatment)
+BPRSL$subject <- factor(BPRSL$subject)
+
+
+# Glimpse the data
+
+glimpse(BPRSL)
+## Rows: 360
+## Columns: 6
+## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
+## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
+## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
+## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
+## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
+## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
+str(BPRSL)
+## 'data.frame': 360 obs. of 6 variables:
+## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
+## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
+## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
+## $ weeks : chr "week0" "week0" "week0" "week0" ...
+## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
+## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
+summary(BPRSL)
+## X treatment subject weeks bprs
+## Min. : 1.00 1:180 1 : 18 Length:360 Min. :18.00
+## 1st Qu.: 90.75 2:180 2 : 18 Class :character 1st Qu.:27.00
+## Median :180.50 3 : 18 Mode :character Median :35.00
+## Mean :180.50 4 : 18 Mean :37.66
+## 3rd Qu.:270.25 5 : 18 3rd Qu.:43.00
+## Max. :360.00 6 : 18 Max. :95.00
+## (Other):252
+## week
+## Min. :0
+## 1st Qu.:2
+## Median :4
+## Mean :4
+## 3rd Qu.:6
+## Max. :8
+##
+glimpse(BPRSL)
+## Rows: 360
+## Columns: 6
+## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
+## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
+## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
+## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
+## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
+## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
+BPRSL data set has 360 observations and 6 variable. From the glimpse +function, we can see the 6 columns, in which two treatment arms are +coded 1 and 2 for treatment 1 and treatment 2. Subjects are coded from 1 +to 20 however the repetition of same code in subjects suggests that +participants were randomized to Treatment 1 or 2.
+#Lets plot the data
+ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
+ geom_line() +
+ scale_linetype_manual(values = rep(1:10, times=4)) +
+ facet_grid(. ~ treatment, labeller = label_both) +
+ theme(legend.position = "none") +
+ scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
++From the plot, it appears that BPRS seems to decrease during the +treatment period in both treatment arms. A clear diffrerence cannot be +validated however between groups from the plots.
+# lets create a regression model for BPRSL
+BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
+
+# print summary of the model
+summary(BPRS_reg)
+##
+## Call:
+## lm(formula = bprs ~ week + treatment, data = BPRSL)
+##
+## Residuals:
+## Min 1Q Median 3Q Max
+## -22.454 -8.965 -3.196 7.002 50.244
+##
+## Coefficients:
+## Estimate Std. Error t value Pr(>|t|)
+## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
+## week -2.2704 0.2524 -8.995 <2e-16 ***
+## treatment2 0.5722 1.3034 0.439 0.661
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Residual standard error: 12.37 on 357 degrees of freedom
+## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
+## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
+We have BPRS score as our target variable and time (weeks) and +treatment 1 and 2 as our explanatory variable. From the summary model, +week variable seem to be statistically significant with BPRS but +treatment variable doesn’t (p = 0.661). No significant difference can be +seen in the difference in BPRS based on treatments However this analyses +assumes independence of observations, i.e., the observation or outcome +is not affected by any other confounder and is completely influenced by +the explanatory variable which is not very rational. Therefore we now +move on to a more stringent analyses which assumes observations ad +dependent variable and can be influence by other effect. We will analyse +the data set with both Fixed-effect models and Random-effect +models.
+#lets load the package
+library(lme4)
+
+# Create a random intercept model
+BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
+
+# Print the summary of the model
+summary(BPRS_ref)
+## Linear mixed model fit by maximum likelihood ['lmerMod']
+## Formula: bprs ~ week + treatment + (1 | subject)
+## Data: BPRSL
+##
+## AIC BIC logLik deviance df.resid
+## 2748.7 2768.1 -1369.4 2738.7 355
+##
+## Scaled residuals:
+## Min 1Q Median 3Q Max
+## -3.0481 -0.6749 -0.1361 0.4813 3.4855
+##
+## Random effects:
+## Groups Name Variance Std.Dev.
+## subject (Intercept) 47.41 6.885
+## Residual 104.21 10.208
+## Number of obs: 360, groups: subject, 20
+##
+## Fixed effects:
+## Estimate Std. Error t value
+## (Intercept) 46.4539 1.9090 24.334
+## week -2.2704 0.2084 -10.896
+## treatment2 0.5722 1.0761 0.532
+##
+## Correlation of Fixed Effects:
+## (Intr) week
+## week -0.437
+## treatment2 -0.282 0.000
+confint(BPRS_ref)
+## Computing profile confidence intervals ...
+## 2.5 % 97.5 %
+## .sig01 4.951451 10.019188
+## .sigma 9.486731 11.026604
+## (Intercept) 42.608796 50.298982
+## week -2.679990 -1.860844
+## treatment2 -1.542804 2.687248
+Lets first reflect on our BPRS data set. Our maxima and minima is 95 +and 18 respectively. Given the high variance, the baseline seems to +differ from the outcomes.
+Let’s fit the random intercept and random slope model to our +data:
+Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This allows us to account for the individual differences in the individuals symptom (BRPS score) profiles, but also the effect of time.
# create a random intercept and random slope model
+BPRS_ref1 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
+
+# print a summary of the model
+
+summary(BPRS_ref1)
+## Linear mixed model fit by maximum likelihood ['lmerMod']
+## Formula: bprs ~ week * treatment + (week | subject)
+## Data: BPRSL
+##
+## AIC BIC logLik deviance df.resid
+## 2744.3 2775.4 -1364.1 2728.3 352
+##
+## Scaled residuals:
+## Min 1Q Median 3Q Max
+## -3.0512 -0.6271 -0.0768 0.5288 3.9260
+##
+## Random effects:
+## Groups Name Variance Std.Dev. Corr
+## subject (Intercept) 64.9964 8.0620
+## week 0.9687 0.9842 -0.51
+## Residual 96.4707 9.8220
+## Number of obs: 360, groups: subject, 20
+##
+## Fixed effects:
+## Estimate Std. Error t value
+## (Intercept) 47.8856 2.2521 21.262
+## week -2.6283 0.3589 -7.323
+## treatment2 -2.2911 1.9090 -1.200
+## week:treatment2 0.7158 0.4010 1.785
+##
+## Correlation of Fixed Effects:
+## (Intr) week trtmn2
+## week -0.650
+## treatment2 -0.424 0.469
+## wek:trtmnt2 0.356 -0.559 -0.840
+confint((BPRS_ref1))
+## Computing profile confidence intervals ...
+## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig02
+## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
+## for .sig02: falling back to linear interpolation
+## 2.5 % 97.5 %
+## .sig01 5.39069597 12.1613571
+## .sig02 -0.82694500 0.1404228
+## .sig03 0.43747319 1.6543265
+## .sigma 9.10743668 10.6349246
+## (Intercept) 43.31885100 52.4522602
+## week -3.34923718 -1.9074295
+## treatment2 -6.04400897 1.4617868
+## week:treatment2 -0.07243289 1.5040996
+# perform an ANOVA test on the two models to assess formal differences between them
+anova(BPRS_ref1, BPRS_ref)
+## Data: BPRSL
+## Models:
+## BPRS_ref: bprs ~ week + treatment + (1 | subject)
+## BPRS_ref1: bprs ~ week * treatment + (week | subject)
+## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
+## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
+## BPRS_ref1 8 2744.3 2775.4 -1364.1 2728.3 10.443 3 0.01515 *
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+ANOVA test seems significant (p < 0.05). Addition of slope +definitely increases the model fit. Its clear that addition of a random +intercept model increased the inter individual variance. Treatment group +seems unaffected over time. Lets see the slope model and see the +outcomes.
+BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
+
+# print a summary of the model
+
+summary(BPRS_ref2)
+## Linear mixed model fit by maximum likelihood ['lmerMod']
+## Formula: bprs ~ week * treatment + (week | subject)
+## Data: BPRSL
+##
+## AIC BIC logLik deviance df.resid
+## 2744.3 2775.4 -1364.1 2728.3 352
+##
+## Scaled residuals:
+## Min 1Q Median 3Q Max
+## -3.0512 -0.6271 -0.0768 0.5288 3.9260
+##
+## Random effects:
+## Groups Name Variance Std.Dev. Corr
+## subject (Intercept) 64.9964 8.0620
+## week 0.9687 0.9842 -0.51
+## Residual 96.4707 9.8220
+## Number of obs: 360, groups: subject, 20
+##
+## Fixed effects:
+## Estimate Std. Error t value
+## (Intercept) 47.8856 2.2521 21.262
+## week -2.6283 0.3589 -7.323
+## treatment2 -2.2911 1.9090 -1.200
+## week:treatment2 0.7158 0.4010 1.785
+##
+## Correlation of Fixed Effects:
+## (Intr) week trtmn2
+## week -0.650
+## treatment2 -0.424 0.469
+## wek:trtmnt2 0.356 -0.559 -0.840
+# perform an ANOVA test on the two models
+anova(BPRS_ref2, BPRS_ref1)
+## Data: BPRSL
+## Models:
+## BPRS_ref2: bprs ~ week * treatment + (week | subject)
+## BPRS_ref1: bprs ~ week * treatment + (week | subject)
+## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
+## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3
+## BPRS_ref1 8 2744.3 2775.4 -1364.1 2728.3 0 0
+Finally, looking at the model, the two model and outcomes seems +similar. Comparing the ANOVA test conforms there isn’t much significant +difference between them. Adding the interaction variable as mentioned +above in model 1 doesn’t seem to work out as the model didn’t change and +the significance level hasn’t changed either.
+# draw the plot of BPRSL with the observed BPRS values
+ggplot(BPRSL, aes(x = week, y = bprs, group = subject, col= treatment)) +
+ geom_line(aes(linetype = treatment)) +
+ scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
+ scale_y_continuous(name = "Observed BPRS") +
+ theme(legend.position = "right") +
+ facet_grid(. ~ treatment, labeller=label_both)
+
+# Create a vector of the fitted values
+Fitted <- fitted(BPRS_ref)
+Fitted1 <- fitted(BPRS_ref1)
+Fitted2 <- fitted(BPRS_ref2)
+
+# Create a new column fitted to BPRSL
+BPRSL <- BPRSL %>% mutate(bprs_fitted_values_BPRSL_ref = Fitted, bprs_fitted_values_BPRSL_ref1 = Fitted1, bprs_fitted_values_BPRSL_ref2 = Fitted2)
+head(BPRSL)
+## X treatment subject weeks bprs week bprs_fitted_values_BPRSL_ref
+## 1 1 1 1 week0 42 0 53.19460
+## 2 2 1 2 week0 58 0 43.04516
+## 3 3 1 3 week0 54 0 43.98584
+## 4 4 1 4 week0 55 0 49.13483
+## 5 5 1 5 week0 72 0 58.19506
+## 6 6 1 6 week0 48 0 41.51037
+## bprs_fitted_values_BPRSL_ref1 bprs_fitted_values_BPRSL_ref2
+## 1 49.24017 49.24017
+## 2 46.97380 46.97380
+## 3 47.65582 47.65582
+## 4 49.85313 49.85313
+## 5 66.39001 66.39001
+## 6 42.59363 42.59363
+ # draw the plot of BPRSL with the Fitted values of bprs model 1
+ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref, group = subject, col=treatment)) +
+ geom_line(aes(linetype = treatment)) +
+ scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
+ scale_y_continuous(name = "Fitted BPRS (model 1: rnd intercept)") +
+ theme(legend.position = "right") +
+ facet_grid(. ~ treatment, labeller=label_both)
+
+# draw the plot of BPRSL with the Fitted values of bprs model 2
+ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref1, group = subject, col=treatment)) +
+ geom_line(aes(linetype = treatment)) +
+ scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
+ scale_y_continuous(name = "Fitted BPRS (model 2: rnd intercept + slope)") +
+ theme(legend.position = "right") +
+ facet_grid(. ~ treatment, labeller=label_both)
++From the plot, we can see random intercept model differs from random +intercept and slope model. Adding a slope intercept didn’t change the +outcomes however. We can also see the final plot random intercept and +slope with interaction model also different from all three model above. +In conclusion we can say that random intercept model doesn’t highlight +the individual’s effect on bprs over time and also the outcomes didn’t +change with subsequent model.
+******************************************************************* +END +******************************************************************
+DONE AND DUSTED \(\large +\surd\) !!!
+ +Well not really !! There is so much more to learn. +This course was however a great “push” I have truly enjoyed the exercise +sessions, the course material and the review exercises. I have learned +so much and I am looking forward to learning more !!
+Merry Christmas and a Happy New Year Everyone !!
+ +