clustering and classification #5

sanaakadi · 2020-11-17T22:05:01Z

MASS, corrplot, tidyr and Boston dataset are available

calculate the correlation matrix and round it

cor_matrix<-cor(Boston) %>% round(digits = 2)

print the correlation matrix

cor_matrix
crim zn indus chas nox rm age dis rad tax ptratio
crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
black lstat medv
crim -0.39 0.46 -0.39
zn 0.18 -0.41 0.36
indus -0.36 0.60 -0.48
chas 0.05 -0.05 0.18
nox -0.38 0.59 -0.43
rm 0.13 -0.61 0.70
age -0.27 0.60 -0.38
dis 0.29 -0.50 0.25
rad -0.44 0.49 -0.38
tax -0.44 0.54 -0.47
ptratio -0.18 0.37 -0.51
black 1.00 -0.37 0.33
lstat -0.37 1.00 -0.74
medv 0.33 -0.74 1.00

visualize the correlation matrix

corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

# learning2014 is available # print out the column names of the data colnames(learning2014) [1] "gender" "Age" "attitude" "deep" "stra" "surf" "Points" # change the name of the second column colnames(learning2014)[2] <- "age" # change the name of "Points" to "points" colnames(learning2014)[7] <- "points" # print out the new column names of the data colnames(learning2014) [1] "gender" "age" "attitude" "deep" "stra" "surf" "points" >

# access the MASS package library(MASS) # load the data data("Boston") # explore the dataset str(Boston) 'data.frame': 506 obs. of 14 variables: $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... $ chas : int 0 0 0 0 0 0 0 0 0 0 ... $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... $ rm : num 6.58 6.42 7.18 7 7.15 ... $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... $ dis : num 4.09 4.97 4.97 6.06 6.06 ... $ rad : int 1 2 2 3 3 3 5 5 5 5 ... $ tax : num 296 242 242 222 222 222 311 311 311 311 ... $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... $ black : num 397 397 393 395 397 ... $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ... summary(Boston) crim zn indus chas Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 nox rm age dis Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 rad tax ptratio black Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 lstat medv Min. : 1.73 Min. : 5.00 1st Qu.: 6.95 1st Qu.:17.02 Median :11.36 Median :21.20 Mean :12.65 Mean :22.53 3rd Qu.:16.95 3rd Qu.:25.00 Max. :37.97 Max. :50.00 # plot matrix of the variables pairs(Boston)

Created a new RMarkdown file and save it as an empty file named ‘chapter4.Rmd’. Then included the file as a child file in your ‘index.Rmd’ file. Performed the following analysis in the file I created. Loaded the Boston data from the MASS package. Explored the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. # MASS, corrplot, tidyr and Boston dataset are available # calculate the correlation matrix and round it cor_matrix<-cor(Boston) %>% round(digits = 2) # print the correlation matrix cor_matrix crim zn indus chas nox rm age dis rad tax ptratio crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29 zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39 indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38 chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12 nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19 rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36 age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26 dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23 rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46 tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46 ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00 black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 black lstat medv crim -0.39 0.46 -0.39 zn 0.18 -0.41 0.36 indus -0.36 0.60 -0.48 chas 0.05 -0.05 0.18 nox -0.38 0.59 -0.43 rm 0.13 -0.61 0.70 age -0.27 0.60 -0.38 dis 0.29 -0.50 0.25 rad -0.44 0.49 -0.38 tax -0.44 0.54 -0.47 ptratio -0.18 0.37 -0.51 black 1.00 -0.37 0.33 lstat -0.37 1.00 -0.74 medv 0.33 -0.74 1.00 # visualize the correlation matrix corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6) >

create_human.R

# tidyr package and human are available # access the stringr package library(stringr) # look at the structure of the GNI column in 'human' str(human$GNI) # remove the commas from GNI and print out a numeric version of it str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric

# dplyr is available # read the RATS data RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t') # Factor variables ID and Group RATS$ID <- factor(RATS$ID) RATS$Group <- factor(RATS$Group) # Glimpse the data glimpse(RATS) Observations: 16 Variables: 13 $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4... $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 4... $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 4... $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 4... $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 4... $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 4... $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 4... $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 5... $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 5... $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 5... $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 5... ># dplyr, tidyr and RATSL are available # Check the dimensions of the data dim(RATSL) [1] 176 5 # Plot the RATSL data ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) + geom_line(aes(linetype = Group)) + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) + scale_y_continuous(name = "Weight (grams)") + theme(legend.position = "top")

sanaakadi added 25 commits November 4, 2020 02:03

Update chapter1.Rmd

5df5afd

Update chapter1.Rmd

a63b64c

Update chapter2.Rmd

0c6f33c

Update chapter2.Rmd

04d1111

Update chapter2.Rmd

c250e0d

Update README.md

a1ca313

Update chapter1.Rmd

ea5503f

Update chapter1.Rmd

0da7fee

Update chapter2.Rmd

abbd234

Create jekyll.yml

40c201e

Create create_learning2014.R

a4ab8a3

Update README.md

c7e0827

Update create_learning2014.R

c2ff20e

Update README.md

552c485

Update README.md

bb9a961

Update README.md

4f907c8

Update README.md

036af75

create_human.R

Update README.md

6ac6bf0

Update README.md

812098e

Update README.md

4837296

jmleppal added a commit to jmleppal/IODS-project that referenced this pull request Nov 20, 2022

Assignment KimmoVehkalahti#5 done

7e5a009

jmleppal added a commit to jmleppal/IODS-project that referenced this pull request Nov 30, 2022

Assignment KimmoVehkalahti#5: Minor revision

68e64d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering and classification #5

clustering and classification #5

sanaakadi commented Nov 17, 2020

clustering and classification #5

Are you sure you want to change the base?

clustering and classification #5

Conversation

sanaakadi commented Nov 17, 2020

MASS, corrplot, tidyr and Boston dataset are available

calculate the correlation matrix and round it

print the correlation matrix

visualize the correlation matrix