-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdemo.Rmd
193 lines (161 loc) · 4.27 KB
/
demo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: "R Notebook"
output: html_notebook
---
```{r}
library(tidyverse)
library(GGally)
library(tidytext)
library(wordcloud2)
library(topicmodels)
```
| Field | Question |
|:-----------------------------------|:-----------------------------------|
| timestamp | Timestamp |
| q1 | How many hours did you sleep last night? |
| q2 | How many hours did you work (e.g. do homework, or a job) after you left school? |
| **q3** | How many hours did you spend relaxing after you left school?\*\* |
| q4 | What did you do to relax last night? |
| q5 | How do you feel today? |
| q6 | Which best describes you (your answer should be the same every day) |
## Start by labelling your data
- Labeling help you understand your data
- Folks often overestimate how long it takes
- If outsourcing...
- Start by labeling the data yourself
- Have an odd number of people labeling
- Come up with consistent rules (heuristics)
```{r}
# Use set_names to name columns
survey_df <- read_csv("survey_results.csv") %>%
set_names(c(
"timestamp",
"q1",
"q2",
"q3",
"q4",
"q5",
"q6",
"watch_shows_movies",
"sleep",
"chat",
"food",
"friends",
"video_games",
"listen_music",
"book"
))
survey_df
```
```{r}
survey_df %>% select(timestamp)
```
## Plot Bar Chart of Common Tags
```{r}
# Select label columns
# Use colSums to get sum of each column
# Use stack to get "long" format of column sums
survey_df %>%
select(
"watch_shows_movies",
"sleep",
"chat",
"food",
"friends",
"video_games",
"listen_music",
"book"
) %>%
colSums() %>%
stack() %>%
ggplot() +
geom_bar(aes(y = ind, x = values), stat = "identity")
```
## Plot Correlation Between Tags
```{r}
# Use select and ggcorr to plot correlations
survey_df %>%
select(
"watch_shows_movies",
"sleep",
"chat",
"food",
"friends",
"video_games",
"listen_music",
"book"
) %>%
ggcorr()
```
## Plot Word Counts
Each **document** is a response, and each **token** is a word.
```{r}
# select the q4 column
# unnest_tokens separates each documents into tokens
# count() the occurrences of each token
# set the width and height while ggsave
survey_df %>%
select(q4) %>%
unnest_tokens(word, q4) %>%
count(word) %>%
ggplot() +
geom_bar(aes(y = reorder(word, n), x = n), stat = "identity")
ggsave(
"word_counts.png",
width = 5,
height = 10
)
```
## Word Clouds (and why they not as good)
<https://r-graph-gallery.com/196-the-wordcloud2-library.html> Warning: <https://www.data-to-viz.com/graph/wordcloud.html>
```{r}
survey_df %>%
select(q4) %>%
unnest_tokens(word, q4) %>%
count(word) %>%
wordcloud2()
```
## Topic Modelling
<https://www.tidytextmining.com/topicmodeling.html>
```{r}
survey_lda <- survey_df %>%
select(q4) %>%
mutate(document = 1:n()) %>%
unnest_tokens(word, q4) %>%
group_by(document) %>%
add_count(word) %>%
cast_dtm(term = word, document = document, value = n) %>%
LDA(k = 2, control = list(seed = 1234))
survey_topics <- tidy(survey_lda, matrix = "beta")
survey_topics
```
```{r}
survey_top_terms <- survey_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
survey_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
```{r}
beta_wide <- survey_topics %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_wide %>%
ggplot() +
geom_bar(aes(y = reorder(term, log_ratio), x = log_ratio), stat = "identity")
ggsave(
filename = "two_topic_model.png",
width = 5,
height = 10
)
```
- How do you determine right number of topics?
- How do you determine what the topics are?