-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment2_finished.Rmd
846 lines (671 loc) · 44 KB
/
Assignment2_finished.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
---
title: "Quantitative Research Skills 2022, Assignment 2"
subtitle: "Factor Analysis"
author: "Rong Guang"
date: "Sep 27 2022"
output:
html_document:
theme: journal
highlight: haddock
toc: true
toc_depth: 2
number_section: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
# Assignment 2: Factor Analysis
Begin working with this when you have completed the **Hands-On Exercise 2**.
*******************************************************************************
**Your task:**
Practice with 2-3 sets of variables (one at a time) in the USHS data, trying different number of factors (2-3), comparing orthogonal and oblique rotations, and hence going through ALL the phases required to do FA (as instructed and shown in the **Hands-On-Exercise 2**).
It should be possible to analyse these particular sets of variables in the USHS data with FA:
(Some of them might be easier to interpret than others).
k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -> NA!), 9 variables
k22_1:k22_10, # how often (felt various things), 10 variables
k23:k34, # General Health Questionnaire (1-4), 12 variables
k46_1:k46_11, # on how many days/week eat various food/(11)skip lunch (0,...,7=every day), 11 variables
k95_1:k95_18, # feelings of studies (1:disagree,6:agree), 18 variables
**OBS:** If you use the k21* variables, remember to wrangle them first ("3" - NA!) See Hands-On-Exercise 1b!
**Note:** With FA, the variables can have different directions (like the items 2 & 3 in k22*), even different scales.
*PERHAPS you could also try to combine variables across different measures: some items of k21 together with some items from k95, for instance!* (I have not tried to analyse that kind of combinations, but it should be perfectly possible, at least if the items would be somehow substantially related to each other. TRY IT!!)
What is also required in this **Assignment 2**, is your **interpretations** (you don't have to be a substance expert, I just encourage you to practice, as those phenomena should be easy enough to understand without a special expertise). Interpreting your analyses is very important, and this course is a safe place to practice that part, too.
**So TRY to interpret the factors you find and give them describing NAMES.**
Finally, show that you can work with the new **factor score variables** in the original data: visualise those scores with histograms, box plots, scatter plots, using also some other variables of the USHS. You will find a lot of good R code for producing such visualisations in the earlier exercises.
*******************************************************************************
# Analysing the USHS data with Factor Analysis (step by step)
Use as many R code chunks and text areas as you need - it's now all up to you!
*Please choose only your BEST FACTOR ANALYSIS and show it here step by step!*
Start by reading in the data and converting the factors (gender and age), just like in the exercise.
## Enviroment and data preparation
**1. Package loading**
```{r}
library (tidyverse)#a combination of useful packages
library(dplyr) # data wrangling
library(psych) # describing data
library(patchwork) # picture layout
library(finalfit) # descriptive and inferential statistics well-tabulated
library(naniar)#install.packages("naniar")#Visualizing missing data
library(broom)#tidy tables
library(ggplot2)
library(tidyr)
library(GPArotation)
```
**2. Data wrangling <br />**
**2.1 Load USHS data**
```{r}
#read raw data
USHS <- read.csv("daF3224e.csv", sep = ";")
#select reduced set of variables into a new data set
USHSpart <- USHS %>%
select(
fsd_id,
k1, # age
k2, # gender (1=male, 2=female, 3=not known/unspecified), NA (missing)
bv1, # university (from register), see label info! (40 different codes)
bv3, # field of study (from register), various codes
bv6, # higher educ institution (0=uni applied sciences, 1=university)
k7, # learning disability
k8_1:k8_4, # current well-being
k9_1:k9_30, # symptoms (past month), skip 31 ("other") due missingness
k11a, k11b, # height (cm) a=male, b=female
k12a, k12b, # weight (kg) a=male, b=female
k13, # what to you think of your weight? (1=under,5=over)
k14:k20, # various questions (mostly yes/no)
k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -> NA!)
k22_1:k22_10, # how often (felt various things) (obs: 2 & 3 should be inversed! see quest.)
k23:k34, # General Health Questionnaire (1-4, directions already OK)
k95_1:k95_18) # (end of select!)
```
**2.2 Variable recreation and generation<br />**
  In this section, I first generated height, weight, age and gender variables. Considering height and weight are better interpreted by referencing each other, I then generated a Body Mass Index (BMI) variable that merges the pair using the well-accepted formula:
$$
BMI=\frac{Weight (kg)}{Height (m)^{2}}
$$
Next, I generated a new categorical variable by slicing BMI into underweight (<18.5), normal weight(~25), overweight(~30) and obese(>30). Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finland student's general health, dietary status and feeling about study.
```{r}
# Recreate height, weight, age and gender variables
USHSv2 <- USHSpart %>%
mutate(height = if_else(is.na(k11a), k11b, k11a), #merge height variables of two sexes into one
weight = if_else(is.na(k12a), k12b, k12a), #merge weight variables of two sexes into one
gender = k2 %>% factor() %>% fct_recode("Male" = "1",
"Female" = "2",
"Not known" = "3"), #add variable name and level label to gender
age = k1 %>%
factor() %>%
fct_recode("19-21" = "1", "22-24" = "2", "25-27" = "3",
"28-30" = "4", "31-33" = "5", "33+" = "6"))#add variable name and level label to age
# Generate BMI(body mass index) and BMI.factor(4 categories)
USHSv2 <- USHSv2 %>%
mutate(BMI = weight/(height/100)^2) # adopting the formula of BMI
USHSv2 <- USHSv2 %>%
mutate(BMI.factor = BMI %>% #turn BMI into factor and add labels(underweight:<18.5, #normal weight: ~25, overweight: ~30; obese: >30)
cut(breaks=c(1,18.5,25,30,100), include.lowest = T) %>%
fct_recode("Underweight" = "[1,18.5]",
"Normal weight" = "(18.5,25]",
"Overweight" = "(25,30]",
"Obese" = "(30,100]") %>%
ff_label("BMI ranges"))
USHSv2[, c(42:45)] <- list(NULL) # I got gender. K11a~k12b were removed.
USHSv2 %>%
select(height, weight, age, gender, BMI, BMI.factor) %>%
head %>%
knitr::kable() # examine the data #sum(is.na(USHSv2$k1)) #apply(USHSv2, 2, function(x)sum(is.na(x))) %>% tidy()#examine NAs for each variable
unique(USHSv2$k21_1)
```
**2.3 data cleansing<br />**
  In item k21(k21_1 to k21_9), the choice "difficult to say" was assigned a value of 3, which could be misleading since these were ordinal variables where 3 indicated a specific strength instead of NA. In this section, this was corrected by converting 3 to NA.
```{r}
#have a look at the unique values in k21_1 through k21_9
USHSv2 %>%
select(starts_with(("k21"))) %>% apply(.,2, function(x)unique(x))%>% knitr::kable()
```
  Finding: the value 3(which I want to convert to NA) is present now
```{r}
#Convert the value 3 in k21_1 to k21_9 to NA
USHSv2 <- USHSv2 %>%
mutate(across(starts_with("k21"), ~replace(., . == 3, NA)))
#have a look at the unique values in k21_1 through k21_9 in the updated data set
USHSv2 %>% select(starts_with(("k21"))) %>%
apply(.,2, function(x)unique(x)) %>% knitr::kable()
```
  Finding: 3s were successfully replaced.
**2.4 data inspecting<br />**
  In this section, missing values of all the variables were inspected.
```{r}
#have a look at the NAs in each column
ncol(USHSv2)#get the number of columns
USHSv2 %>% select(1:25) %>% vis_miss()
USHSv2 %>% select(26:50) %>% vis_miss()
USHSv2 %>% select(51:75) %>% vis_miss()
USHSv2 %>% select(76:104) %>% vis_miss() #visualizing the NAs. To get a clear view, make 4 pictures
#calculating the percent of NAs throughout k11a to k12b
```
  Findings were variables k1, k7, and k9_1~k9_3 suffered from quite a number of NAs, suggesting a quantitative inspection.
```{r}
#a closer look of the identified variables
USHSv2 %>% select(k1,k7,k9_1:k9_30) %>% vis_miss()
#
USHSv2 %>% select (k1,k7,k9_1:k9_30) %>%
apply(.,2,function(x)sum(is.na(x))/nrow(.)) %>%
tidy
```
  Findings were k1 (6.7% NAs), k7(4.1% NAs), k9_1 (3.1% NAs), k9_3 (3.0% NAs), k9_5 (4.3% NAs), k9_8 (6.5%), 9_10 (5.0% NAs), 9_13 (5.1 NAs), 9_24 (6.3% NAs), 9_25 (6.5% NAs) and 9_26 (5.9% NAs) were actually acceptable in the terms of NA proportion. Other variables were with NA proportion>7%, with some of them reaching 12%, suggesting possible bias such as floor effect. Analysis of k9 should be extremely careful.
**2.4 Variable selecting <br />**
  In this section, variables for general heath k23~k34 were selected into a new data set USHSghq for factor analysis #1. Variables k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2.
```{r}
#k23~k34 were selected into a new data set USHSghq for factor analysis #1
USHSghq <- USHSv2 %>% select(k23:k34)
USHSghq %>% head()
USHSghq%>% apply(2, function(x)unique(x))
#k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2
USHSstu <- USHSv2 %>% select(k95_1:k95_18)
USHSstu %>% apply(2, function(x)unique(x))
```
## Analysis 1: Factor analysis of general health questionnaire
**1. Analysis <br />**
**1.1 Descriptive statistics for variables <br />**
  In this section, descriptive statistics was done for each variable of USHSghq.
```{r}
#have a look at the unique values in k23~k34
USHSghq %>%
apply(., 2, function(x)unique(x))%>%
knitr::kable()
```
  Findings were all variables contain values of 1,2,3,4 and NA.
```{r}
#visualizing the number of values for each item
longv1 <- USHSv2 %>% #There are two patterns of choices in k23:k34. Select
#items of pattern one and convert them to long format
select(k24, k27, k28, k31, k32, k33) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
p1 <- longv1 %>% #plot bar charts for the number of each pattern one choice
#for each item
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_minimal()+
xlab("1=not at all; 2=no more than usual;
3=rather more than usual; 4=much more than usual")+ #choices of pattern 1
theme(legend.position ="none")
longv2 <- USHSv2 %>% #There are two patterns of choices in k23:k34. Select
#items of pattern two and convert them to long format
select(k23, k25, k26, k29, k30, k34) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
p2 <- longv2 %>% #plot bar charts for the number of each pattern tow choice
#for each item
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_minimal()+
xlab("1=more so than usual; 2=same as usual;
3=less so than usual; 4=much less than usual")+ #choices of pattern 2
theme(legend.position ="none")
p1/p2
```
  The findings were "Not at all" and "No more than usual" was the most frequent two choices for item k24, k27, k28, k31, k32 and k33, indicating most respondents were in a stable and healthy state.
  "Same as usual" was the most frequent choice for item k23, k25, k26, k29, k30 and k34, indicating most respondents were in a stable and healthy state.
```{r}
#Examining if missing values are dependent on any demographic variables
long3 <- USHSv2 %>% # convert health-related variables to long format
select (fsd_id, gender, BMI.factor, k23:k34) %>%
pivot_longer(k23:k34, names_to = "item", values_to = "score") %>%
filter(is.na(score))
p1 <- long3 %>% # plot stacked bar chart of NA frequencies base on
#gender for each item
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_fill(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
plot.title = element_text(vjust = 1),
axis.title.y = element_text(size = 9))+
labs(title = "Percentage of NAs by gender",
y = "Percentage of NAs", x ="", subtitle = "uncorrected")
p2 <- long3 %>% # plot stacked bar chart of percentage of NA
#frequencies base on gender for each item
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_stack(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text(size = 9))+
labs(title = "Number of NAs by gender",
y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#The following codes are discarded since they didn't correct for sample size
#p3 <- long3 %>% # a set of plot bar charts of NA frequencies base on
# gender, each chart for one item.
# ggplot(aes(x = gender, fill = gender))+
# geom_bar()+
# ylab("number of NAs")+
# facet_wrap(~item)+
# theme(axis.text = element_text(size =5, angle =45))
#p3 <- Igender_cor %>%
# ggplot(aes(x = gender, y = n, fill = gender))+
# geom_col()+
# facet_wrap(~item)+
# theme(axis.text = element_text(size = 5, angle = 45))+
# labs(title = "Number of NAs by gender",
# subtitle ="corrected for sample size",
# y = "Corrected number", x = "Gender")
Igender <- long3 %>% count(item, gender)#create contingency table for item&gender
Igender_cor <- Igender %>% #correct for effect of sample size by dividing male
#NA number with the sample size of male(n = 1068),
#and female NA number with that of female(n = 2022),
mutate (score = case_when(gender == "Male" ~ n/1068,
gender == "Female" ~ n/2022,
is.na(gender) ~ n/20)) %>% filter (!is.na(gender))
Igender_cor
p3 <- Igender_cor %>%
ggplot(aes(x = item, y = score, fill = gender))+
geom_col(position = position_fill())+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text (angle = 45))+
labs(title = "Percent of NAs by gender",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "Gender"))
p1/p2|p3
```
  The findings were females givng NA answers were much more than males (see pictures on the left). However, we should bring into consideration the fact the female respondents were around two times the umber of males. Hence, I corrected the number by diving the number of each subgroup of NAs by the sample size of subgroups(male and female), obtaining the graph corrected for sample size. See picture on the right. It demonstrated that though the proportion of females giving NA answers is still larger than males, the difference becomes much less pronounced. Still, attention should be given to items k23 and k25, which around 80% of NAs were from females, as well as items k27, which around 70% NAs were from males, indicating bias such as floor bias might exist in these items.
```{r}
p4 <- long3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_fill(reverse = T))+
ylab("Percent of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9),
plot.title = element_text (vjust = 1))+
labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
p5 <- long3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_stack(reverse = T))+
ylab("Number of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9))+
labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
long3 %>% count(BMI.factor)
IBMI <- long3 %>% count(item, BMI.factor)#create contingency table for item&gender
IBMI <- IBMI %>% rename(score = n)
IBMI_cor <- IBMI %>% #correct for effect of sample size by dividing number of
#normal weight respondents with the sample size of,
#it(n = 118). Apply the same idea to other levels.
mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
BMI.factor == "Overweight" ~ score/39,
BMI.factor == "Obese" ~ score/25,
BMI.factor == "Underweight" ~ score/1,
is.na(BMI.factor) ~ score/31))
IBMI_cor
p6 <- IBMI_cor %>%
ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
geom_col(position = position_fill(reverse = T))+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text(angle = 45))+
labs(title = "Percent of NAs by BMI",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "BMI"))
p4/p5|p6
```
  Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picutre on right), the NA answers from different BMI groups became more equally distribute across each subcategory. However, the observation is inconclusive due to a large proportion lacking BMI information.
```{r}
#descriptive statistics for general health questions.
describe(USHSghq)
```
  Findings were the mean and median were very close to each other. Standard deviations were more varied, indicating some items have different filling out pattern with others.
```{r}
longv4<- USHSghq %>% # convert health-related variables to long format
pivot_longer(everything(), names_to = "Item", values_to = "Score")
longv4 %>% ggplot(aes(x = Score, fill = Score)) +
geom_histogram(binwidth=0.9) +
facet_wrap(~Item)
```
  Findings were the assumption of multivariate normality was severely violated (at least half of the variables were not normal), see histogram above. This indicated I might have to exclude maximum likelihood for factoring due to its poor validity for non-normal data.
**1.2 Factorability of the items <br />**
  Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used.
```{r}
#correlation to examine the factorability
c_matrix <- cor.plot(USHSghq, method = "spearman")
#preparing USHSghq with categorical data
USHSghq_more <- USHSghq %>% mutate (fsd_id = USHSv2$fsd_id, gender = USHSv2$gender, BMI.factor = USHSv2$BMI.factor, age = USHSv2$age) %>% select (fsd_id, everything())
```
  It was observed that all of the 12 items correlated at least 0.4 with at least one other item, suggesting good factorability.
```{r}
#KMO
KMO(c_matrix)
```
  The Kaiser-Meyer-Olkin measure of sampling adequacy was .93, indicating marvelous adequacy according to Kaiser (1975).
```{r}
cortest.bartlett(c_matrix, nrow(USHSghq))
```
  Bartlett’s test of sphericity was significant (χ2 (66) = 15570.3, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected and hence good factorability.
**1.3 Estimating factor structure **
  Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used.
```{r}
# Parallel analysis and scree plot
fa.parallel(USHSghq, fa = "fa", fm = "ml", plot = T)
# very simple structure and BIC
nfactors(c_matrix, n=3, rotate = "oblimin", fm = "pa", n.obs=12)
#vss(c_matrix,n=5, rotate = "oblimin", fm = "pa", n.obs=12)
```
  The parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a uni-dimensional interpretation.Results from the VSS analysis also point to the optimal number of one factors, this was further consolidated by both Velicer MAP and SABIC, both suggesting a 1-factor solution.
**1.4 Presetting parameters **
I loaded the variables on 4 factors based on the result of parallel analysis. I also tested a 3 factors solution for exercise purpose, albeit no mathematical foundation supporting to do it. Oblimin rotation will be adopted because of the fact that dimensions of health situations were usually highly correlated and the hand-on-exercise examining a 2 factors solution has seen a correlation of 0.75 between the factors. I assumed that there was some latent construct called "general health" that is influencing how people answer these questions, and hence principle axis factoring (PAF) was adopted instead of principle component analysis (PCA; on other hand, PCA is orthogonal by nature).Since the assumption of multivariate normality is severely violated, I also did not adopt maximum likelihood factoring.
**1.4 Factor analysis **
```{r}
#Examining the 4 factors solution
fa4 <- fa(USHSghq, nfactors = 4, rotate = "oblimin", fm = "pa", scores = "regression")
# draw a diagram of the factor model (measurement model):
fa.diagram(fa4, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")
#print the results of analysis
print(fa4, digits = 2, sort = T)
print(fa4$loadings,cutoff = 0.3, sort=TRUE)
summary(fa4)
prop.table(fa4$values) #how much variance explained
```
```{r}
#Examining the 4 factors solution
fa3 <- fa(USHSghq, nfactors = 3, rotate = "oblimin", fm = "pa", scores = "regression")
# draw a diagram of the factor model (measurement model):
fa.diagram(fa3, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")
#print the results of analysis
print(fa3, digits = 2, sort = T)
print(fa3$loadings,cutoff = 0.3, sort=TRUE)
prop.table(fa3$values)
```
  Solutions for three and four factors were each examined using oblimin rotations of the factor loading matrix.Correlaion coefficients among the factors in both solutions ranges from 0.48 to 0.68, consolidating the appropriateness of oblimin rotaion. Although the number of primary loading were sufficient and roughly equal for both solutions, the four factor solution, which explained 81.40% of the variance, was preferred because of: (a) hand-on-exercise found a 2 factor solution did not provide sufficient granularity for interpreting the structure; (b) the ‘leveling off’ of eigen values on the scree plot after four factors; and (c) the insufficient number of primary loadings and difficulty of interpreting 3 factor structure from medical perspective.
**1.5 conclusion of factor analysis**
  In the four factor structure I finally adopted, items k31, k32, k33 and k28 comprising the first factor, capturing "Social well-being"; items k25, k26, k23 and k30 comprising the second factor, capturing "social engagement"; items k34 and k29 comprising the third factor, capturing "mental well-being"; and items k27 and k24 comprising the fourth factor, capturing "emotional well-being".
## Analysis 2: Factor analysis of study burnout questionnaire
**1. Analysis <br />**
**1.1 Descriptive statistics for variables <br />**
  In this section, descriptive statistics was done for each variable of USHSghq.
```{r}
#have a look at the unique values in k23~k34
USHSstu %>%
apply(., 2, function(x)unique(x))%>%
knitr::kable()
```
```{r}
#adapting the variable names for better displaying effect
USHSstu <- USHSstu %>% rename(k95_01 = k95_1, k95_02 = k95_2, k95_03 = k95_3, k95_04 = k95_4, k95_05 = k95_5, k95_06 = k95_6, k95_07 = k95_7, k95_08 = k95_8, k95_09 = k95_9)
#generate two subsets of data containing negative-wording items (01~09) and postive-wording items (10~18), respectively
slongv1 <- USHSstu %>% select (k95_01:k95_09) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
slongv2 <- USHSstu %>% select (k95_10:k95_18) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
```
```{r}
#visualizing the number of values for negative-wording items
slongv1 %>% #plot bar charts for the number of each pattern one choice
#for each item
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_get()+
xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+
ggtitle("Distribution of choices for negative-wording items under k95")+
theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold"))
```
  The findings were "totally disagree" was the most frequent two choices, followed by "disagree", indicating most students were not experiencing study burnout in dimensions these items cover. Besides, the data were mostly skewed towards left-side. This indicated care should be taken in considering maximum likelihood factoring.
```{r}
slongv2 %>% #plot bar charts for the number of each pattern one choice
#for each item
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_get()+
xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+
ggtitle("Distribution of choices for postive-wording items under k95")+
theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold"))
```
  The findings were "partly agree" was the most frequent two choices, followed by "agree" and "partly disagree", indicating most students were experiencing little to none study burnout in the dimensions these items cover. The possibility of modesty bias towards weak agreement in facing of positive descriptions should also be considered (this could explain why in negative-wording items 1~9 I did not run into such pattern).Besides, the data were mostly normally distributed, indicating positive wording might be good at eliciting a gradient of state changes.
```{r}
#Examining if missing values are dependent on any demographic variables
slongv3 <- USHSv2 %>% # convert health-related variables to long format
select (fsd_id, gender, BMI.factor, k95_1:k95_18) %>%
pivot_longer(k95_1:k95_18, names_to = "item", values_to = "score") %>%
filter(is.na(score))
sp1 <- slongv3 %>% # plot stacked bar chart of NA frequencies base on
#gender for each item
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_fill(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
plot.title = element_text(vjust = 1),
axis.title.y = element_text(size = 9), axis.text.x = element_text(angle =45, vjust = 0.7))+
labs(title = "Percentage of NAs by gender",
y = "Percentage of NAs", x ="", subtitle = "uncorrected")
sp2 <- slongv3 %>% # plot stacked bar chart of percentage of NA
#frequencies base on gender for each item
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_stack(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text(size = 9),
axis.text.x = element_text(angle =45, vjust = 0.7))+
labs(title = "Number of NAs by gender",
y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#The following codes are discarded since they didn't correct for sample size
#p3 <- long3 %>% # a set of plot bar charts of NA frequencies base on
# gender, each chart for one item.
# ggplot(aes(x = gender, fill = gender))+
# geom_bar()+
# ylab("number of NAs")+
# facet_wrap(~item)+
# theme(axis.text = element_text(size =5, angle =45))
#p3 <- Igender_cor %>%
# ggplot(aes(x = gender, y = n, fill = gender))+
# geom_col()+
# facet_wrap(~item)+
# theme(axis.text = element_text(size = 5, angle = 45))+
# labs(title = "Number of NAs by gender",
# subtitle ="corrected for sample size",
# y = "Corrected number", x = "Gender")
sIgender <- slongv3 %>% count(item, gender)#create contingency table for item&gender
sIgender_cor <- sIgender %>% #correct for effect of sample size by dividing male
#NA number with the sample size of male(n = 1068),
#and female NA number with that of female(n = 2022),
#sIgender mean study-burnout item/gender
#sIgender_cor means study-burnout item/gender corrected
mutate (score = case_when(gender == "Male" ~ n/1068,
gender == "Female" ~ n/2022,
is.na(gender) ~ n/20)) %>% filter (!is.na(gender))
sp3 <- sIgender_cor %>%
ggplot(aes(x = item, y = score, fill = gender))+
geom_col(position = position_fill())+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text (angle = 45))+
labs(title = "Percent of NAs by gender",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "Gender"))
sp1/sp2|sp3
```
  The findings were females givng NA answers were much more than males (see pictures on the left). However, after correcting the number of each subgroup (male and female) by its sample size,the distribution of NAs became almost consistent across gender. See picture on the right.
```{r}
sp4 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_fill(reverse = T))+
ylab("Percent of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9),
plot.title = element_text (vjust = 1),
axis.text.x = element_text(angle =45, vjust = 0.7))+
labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
sp5 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_stack(reverse = T))+
ylab("Number of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9),
axis.text.x = element_text (angle = 45, vjust = 0.7))+
labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
slongv3 %>% count(BMI.factor)
sIBMI <- slongv3 %>% count(item, BMI.factor)#create contingency table for item&gender
sIBMI <- sIBMI %>% rename(score = n)
sIBMI_cor <- sIBMI %>% #correct for effect of sample size by dividing number of
#normal weight respondents with the sample size of,
#it(n = 118). Apply the same idea to other levels.
mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
BMI.factor == "Overweight" ~ score/39,
BMI.factor == "Obese" ~ score/25,
BMI.factor == "Underweight" ~ score/1,
is.na(BMI.factor) ~ score/31))
sp6 <- sIBMI_cor %>%
ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
geom_col(position = position_fill(reverse = T))+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text(angle = 45, vjust= 0.5))+
labs(title = "Percent of NAs by BMI",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "BMI"))
sp4/sp5|sp6
```
  Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picutre on right), the NA answers from different BMI groups became more equally distribute across each subcategory. However, the observation is inconclusive due to a large proportion lacking BMI information. Also note that in the bar graph on the right, underweight respondents seemed to comprise the majority of NAs across a couple of items, but I decided to ignore it since the number of underweight respondents were quite small (n=20), leading to overestimation by correction.
```{r}
#descriptive statistics for general health questions.
describe(USHSstu, IQR = T)
?describe
```
  Findings were the median of score for positive/negative wording items were more close to its item-group member and IQRs were consistent within some subgroups of items and were varied across these subgroups, indicating some items have different filling out pattern. Means were less varied across all items, but due to the non-normality for half of the item results, here mean and standard deviation would be less convincing than median and IQR.
**1.2 Factorability of the items <br />**
  Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used.
```{r}
#correlation to examine the factorability
sc_matrix <- cor.plot(USHSstu) # sc_matrix means study-burnout correlation matrix
#preparing HSHSstu data with categorical variables
USHSstu_more <- USHSstu %>% mutate (fsd_id = USHSv2$fsd_id, gender = USHSv2$gender, BMI.factor = USHSv2$BMI.factor, age = USHSv2$age)
```
  It was observed that all of the 12 items correlated at least 0.5 with at least one other item, suggesting good factorability.
```{r}
#KMO
KMO(sc_matrix)
```
  The Kaiser-Meyer-Olkin measure of sampling adequacy was .94, indicating marvelous adequacy according to Kaiser (1975).
```{r}
cortest.bartlett(sc_matrix, nrow(USHSghq))
```
  Bartlett’s test of sphericity was significant (χ2 (66) = 36358.71, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected and hence good factorability.
**1.3 Estimating factor structure **
  Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used.
```{r}
# Parallel analysis and scree plot
fa.parallel(USHSstu, fa = "fa", fm = "ml", plot = T)
# very simple structure and BIC
nfactors(sc_matrix, n=5, rotate = "oblimin", fm = "pa", n.obs=12)
#vss(c_matrix,n=5, rotate = "oblimin", fm = "pa", n.obs=12)
```
  The parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a 2 factor interpretation by the given complexity peaking at the second factor. This result was echoed by BIC, where the minimum value was achieved with 2 factors. Whereas VSS analysis recommended a 3 factor solution, which was inconsistent with others.
**1.4 Presetting parameters **
  Base on the factor structure exploration, solutions for three, four, five and six factors were each examined using varimax and oblimin rotations of the factor loading matrix. Both rotation was selected since I was not able to presume the relationship between dimensions of study burnout. I assumed that "study burnout" was not a latent construct that is influencing how people answer these questions. Instead, it should be the outcome of the states reflected by the items. Hence principle component analysis was adopted. Since the assumption of multivariate normality is basically violated, I also did not adopt maximum likelihood factoring.
**1.4 Factor analysis **
*1.4.1 Examining 2-factor structure*
```{r}
# a) orthogonal rotation
sfa2_var <- fa(USHSstu, nfactors = 2, #sfa2_var means study factor analysis 2 factors by varimax
rotate = "varimax", fm = "ml", scores = "regression")
sfa2_var
fa.diagram(sfa2_var, cut = 0, digits =2,
main = "Factor Analysis, 2-factor structure, Orthogonal rotation")
#b) oblimin rotation
sfa2_obl <- fa(USHSstu, nfactors = 2, #sfa2_obl means study factor analysis 2 factors by oblimin
rotate = "oblimin", fm = "ml", scores = "regression")
sfa2_obl
fa.diagram(sfa2_obl, cut = 0, digits =2,
main = "Factor Analysis, 2-factor structure, Oblimin rotation")
```
  The correlation between the 2 factors were weak (r = -0.35), indicating oblimin rotation might be a proper choice theoretically.However, by examining the items under each factor, I found the most possible interpretation for this 2 factor solution is the differnt directions of wording, since factor 1 contain items 1-9, which were negatively worded, and items 10~18 under factor 2 were all positively worded. They should of course be negatively correlated on face value, and, on the other hand, the factoring solution was not sufficiently convincing because it did not capture meaningful dimensions of study burnout.
```{r}
#results by varimax rotation
print(sfa2_var, digits = 2, sort = T)
print(sfa2_var$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa2_var)
prop.table(sfa2_var$values) #how much variance explained
#results by oblimin rotation
print(sfa2_obl, digits = 2, sort = T)
print(sfa2_obl$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa2_obl)
prop.table(sfa2_obl$values) #how much variance explained
```
  Cross-loading was very pronounced by both rotation methods (especially varimax rotation). This indicated we might either remove some items or seek for other factor solutions. This echoed with the fact that this two-factor structure did not provide any meaningful interpretation of dimension of study burnout.
*1.4.2 Examining 3-factor structure*
```{r}
# a) orthogonal rotation
sfa3_var <- fa(USHSstu, nfactors = 3, #sfa2_var means study factor analysis 2 factors by varimax
rotate = "varimax", fm = "ml", scores = "regression")
sfa3_var
fa.diagram(sfa3_var, cut = 0, digits =2,
main = "Factor Analysis, 3-factor structure, Orthogonal rotation")
#b) oblimin rotation
sfa3_obl <- fa(USHSstu, nfactors = 3, #sfa2_obl means study factor analysis 2 factors by oblimin
rotate = "oblimin", fm = "ml", scores = "regression")
sfa3_obl
fa.diagram(sfa3_obl, cut = 0, digits =2,
main = "Factor Analysis, 3-factor structure, Oblimin rotation")
```
  The correlation among the 3 factors were weak to moderate (0.18~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. The interpretation of this 3-factor structure became easier and more enlightening than the 2-factor solution. By qualitative analysis, I assumed items 3,5,2,6 and 8 captured the dimension of exhaustion (item example: I feel a lack of motivation in my studies and often thinking of giving up); items 7, 1, 9, 4 and 3
captured sense of pressure (Example: I often have feelings of inadequacy in my studies.); other 9 items together captured sense of accomplishment (Example: I find my studies full of meaning and purpose).
```{r}
#results by varimax rotation
print(sfa3_var, digits = 2, sort = T)
print(sfa3_var$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa3_var)
prop.table(sfa3_var$values) #how much variance explained
#results by oblimin rotation
print(sfa3_obl, digits = 2, sort = T)
print(sfa3_obl$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa3_obl)
prop.table(sfa3_obl$values) #how much variance explained
```
  Cross-loading was very pronounced by orthogonal rotation (7 cross-loadings, and 2 of them loading on 3 factors simultaneously), but via oblimin rotation it reduced to an acceptable situation (2 cross-loading items with loadings of 0.31 vs 0.61, and 0.37 vs 0.41 ). This indicated the results by using oblimin rotation should be preferred.
*1.4.3 Examining 4-factor structure*
```{r}
# a) orthogonal rotation
sfa4_var <- fa(USHSstu, nfactors = 4, #sfa2_var means study factor analysis 2 factors by varimax
rotate = "varimax", fm = "ml", scores = "regression")
sfa4_var
fa.diagram(sfa4_var, cut = 0, digits =2,
main = "Factor Analysis, 4-factor structure, Orthogonal rotation")
#b) oblimin rotation
sfa4_obl <- fa(USHSstu, nfactors = 4, #sfa2_obl means study factor analysis 2 factors by oblimin
rotate = "oblimin", fm = "ml", scores = "regression")
sfa4_obl
fa.diagram(sfa4_obl, cut = 0, digits =2,
main = "Factor Analysis, 4-factor structure, Oblimin rotation")
```
  The correlation among the 4 factors were weak to moderate (0.12~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. However, via both of rotation methods, the variables still loaded on only 3 factors instead of 4 factors as we specified, and the factoring was the same with that of the 3-factor solution.
```{r}
#results by varimax rotation
print(sfa4_var, digits = 2, sort = T)
print(sfa4_var$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa4_var)
prop.table(sfa4_var$values) #how much variance explained
#results by oblimin rotation
print(sfa4_obl, digits = 2, sort = T)
print(sfa4_obl$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa4_obl)
prop.table(sfa4_obl$values) #how much variance explained
```
  Cross-loadings in this 4-factor solution were very pronounced (worse than the 3-factor solution). Due to fact that it provided same factoring with 3 factor solution, I did not look into it futher.
Solutions for two, three and four factors were each examined using varimax and oblimin rotations of the factor loading matrix. Correlaion coefficients among the factors in all solutions ranges from 0.12 to 0.55, not strongly pointing to the appropriateness of any rotation methods. The three factor solution by oblimin rotation, which explained 68.0% of the variance, was preferred because of: (a) 2-factor solution was hardly interpretable from the perspective of study burnout (b) although eigen values leveled off on the scree plot after four factors, in the analsyis of the four-factor structure, items still loaded on 3 variables; and (c) 2-factor and 4-factor were plagued by cross-loading.
**1.5 conclusion of factor analysis**
In the 4-factor structure I finally adopted, items k31, k32, k33 and k28 comprising the first factor, capturing "Social well-being"; items k25, k26, k23 and k30 comprising the second factor, capturing "social engagement"; items k34 and k29 comprising the third factor, capturing "mental well-being"; and items k27 and k24 comprising the fourth factor, capturing "emotional well-being".
*******************************************************************************
**GOOOOOD JOB!!**
In the end, you should be ready to KNIT this document and SUBMIT the result (.html file) as your report of the **Assignment 2** in Moodle. I am completely sure that you learned again a lot! **GRRRRREAT!!!**
discard
USHSghqv2 <- USHSv2 %>% select(k24:k24,k26,k28:k34)
c_matrix_v2 <- cor.plot(USHSghq)
fa.parallel(c_matrix_v2, fa = "fa", fm = "ml", plot = T)
nfactors(c_matrix_v2, n=5, rotate = "oblimin", fm = "pa", n.obs=12)
USHSv2
fa <- USHSv2 %>%
select (fsd_id, gender, BMI.factor, k23:k34)
fa_male <- fa %>% filter (gender == "Male") %>% select(k23:k34)
fa_female <- fa %>% filter (gender == "Female") %>% select(k23:k34)
fa_male_matrix <- cor.plot(fa_male)
fa_female_matrix <- cor.plot(fa_female)
fa.parallel(fa_male, fa = "fa", fm = "ml", plot =T)
fa.parallel(fa_female, fa = "fa", fm = "ml", plot =T)
nfactors(fa_male_matrix, n=4, rotate = "oblimin", fm ="pa", n.obs=12)
nfactors(fa_female_matrix, n=4, rotate = "oblimin", fm ="pa", n.obs=12)
####
USHSghq_no30 <- USHSghq %>% select(k23:k29,k31:k34)
c_matrix_v2 <- cor.plot(USHSghq_no30)
fa4_no30 <- fa(USHSghq_no30, nfactors = 3, rotate = "oblimin", fm = "pa", scores = "regression")
fa.diagram(fa4_no30, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")
print(fa4_no30, digits = 2, sort = T)
print(fa4$loadings,cutoff = 0.3, sort=TRUE)
prop.table(fa4$values)