-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.Rmd
252 lines (182 loc) · 8.14 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "<br><small> Creating [LDAvis](https://cran.r-project.org/web/packages/LDAvis/index.html) - A Tutorial </small>"
subtitle: "<small> <br>[Marcin Kosiński](http://r-addict.com/About.html) </small>"
author: "<small><a href='https://r-addict.com'><i class='fa fa-comment'></i></a> <a href='https://stackoverflow.com/users/3857701'><i class='fa fa-stack-overflow'></i></a> <a href='https://github.com/MarcinKosinski'><i class='fa fa-github'></i></a> <a href='mailto:[email protected]'><i class='fa fa-envelope-o'></i></a></small><br>"
date: October 24, 2016
output:
revealjs::revealjs_presentation:
theme: black
highlight: pygments
self_contained: false
center: true
reveal_options:
slideNumber: true
---
```{r, include=FALSE}
# edit main page with file.edit(system.file("rmarkdown/templates/revealjs_presentation/resources/default.html",package="revealjs"))
# line 126
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
library(knitr)
opts_chunk$set(
comment = "",
fig.width = 12,
message = FALSE,
warning = FALSE,
tidy.opts = list(
keep.blank.line = TRUE,
width.cutoff = 80
),
options(width = 80),
eval = FALSE
)
```
# Motivation
## Motivation
<small>
- Text mining is a new challenge for machine wandering practitioners.
- The increased interest in the text mining is caused by an augmentation of internet users and by rapid growth of the internet data which is said that *in 80% is a text data*.
- Extracting information from articles, news, posts and comments have became a desirable skill but what is even more needful are tools for text mining models diagnostics and visualizations.
- Such visualizations enable to better understand the insight from a model and provides an easy interface for presenting your research results to greater audience.
</small>
# LDA overview
## LDA overview
From [Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
<small>
> In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.
</small>
# Data
## Show Case on R-Bloggers
<small>
For this presentation I have used articles from [R-Bloggers](http://r-bloggers.com/).
They can be downloaded from [this repository](https://github.com/MarcinKosinski/r-bloggers-harvesting).
The data harvesting process is explained in this post: [LDAvis Show Case on R-Bloggers](http://r-addict.com/2016/06/21/LDAvis-RBloggers.html)
```{r}
library(RSQLite)
db.conn <-
dbConnect(
dbDriver("SQLite"),
"r-bloggers.db"
)
posts <- dbGetQuery(db.conn,
"SELECT text from posts")
dbDisconnect(db.conn)
```
</small>
## Data preparation 1
<small>
Normally I would use `LDA()` function from [topicmodels](https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf) package to fit LDA model because the input can be of class `DocumentTermMatrix` which is an object from [tm](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) package.
`DocumentTermMatrix` object is very convinient for working with text data ([check this Norbert Ryciak's post](http://www.rexamine.com/2014/06/text-mining-in-r-automatic-categorization-of-wikipedia-articles/)) because there exists `tm_map` function which can be applied to all documents for stop words removal, lowering capital letters and removal of words that did not occur in x % of documents.
I haven't seen `LDAvis` examples for models generated with topicsmodel package so we will use traditional approach to text processing.
The [stemming](https://en.wikipedia.org/wiki/Lemmatisation) and stopwords removal was performed during the data collection.
```{r}
## the following fragment of code in this section
## is motivated by
## http://cpsievert.github.io/LDAvis/reviews/reviews.html
# tokenize on space and output as a list:
doc.list <- strsplit(posts[, 1], "[[:space:]]+")
# compute the table of terms:
term.table <- table(unlist(doc.list))
term.table <- sort(term.table, decreasing = TRUE)
# remove terms that occur fewer than 5 times:
term.table <- term.table[term.table > 5]
vocab <- names(term.table)
```
</small>
## Data preparation 2
<small>
The `lda.collapsed.gibbs.sampler()` function from `lda` package has uncomfortable input format (regarding to `LDA()` from `topicmodels` package) so I basically used [cpsievert](https://github.com/cpsievert) snippets
```{r}
# now put the documents into the format required by the lda package:
get.terms <- function(x) {
index <- match(x, vocab)
index <- index[!is.na(index)]
rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
documents <- lapply(doc.list, get.terms)
# Compute some statistics related to the data set:
D <- length(documents) # number of documents (3740)
W <- length(vocab) # number of terms in the vocab (18,536)
doc.length <- sapply(documents,
function(x) sum(x[2, ]))
# number of tokens per document [312, 288, 170, 436, 291, ...]
N <- sum(doc.length) # total number of tokens in the data (546,827)
term.frequency <- as.integer(term.table)
# frequencies of terms in the corpus [8939, 5544, 2411, 2410, 2143, ...]
```
</small>
# Fitting the Model
## Fitting the Model
<small>
From `lda` package documentation
> ... [this function] takes sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling.
```{r}
# MCMC and model tuning parameters:
K <- 20
G <- 5000
alpha <- 0.02
eta <- 0.02
# Fit the model:
library(lda)
set.seed(456)
fit <- lda.collapsed.gibbs.sampler(
documents = documents, K = K,
vocab = vocab, num.iterations = G,
alpha = alpha, eta = eta,
initial = NULL, burnin = 0,
compute.log.likelihood = TRUE
)
```
</small>
## Getting the Model
<small>
```{r}
library(archivist)
saveToRepo(fit, repoDir = "../Museum")
```
The computations took very long, so in case you would like to
get model faster, I have archived my model on GitHub with the help of
[archivist](http://r-bloggers.com/r-hero-saves-backup-city-with-archivist-and-github/) package.
You can easily load this model to R with
```{r}
archivist::aread('MarcinKosinski/Museum/fa93abf0ff93a7f6f3f0c42b7935ad4d') -> fit
```
</small>
# LDAvis use case
## LDAvis use case
<small>
If you google out properly you'll wind out that LDAvis description is
> Tools to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.
[Detailed vignette about LDAvis input and output can be found here](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf).
</small>
## LDAvis Preparations
<small>
To visualize the result using LDAvis, we'll need estimates of the document-topic distributions, which we denote by the DxK matrix theta, and the set of topic-term distributions, which we denote by the K×W matrix phi.
```{r}
theta <- t(apply(fit$document_sums + alpha,
2,
function(x) x/sum(x)))
phi <- t(apply(t(fit$topics) + eta,
2,
function(x) x/sum(x)))
```
```{r}
library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
phi = phi,
theta = theta,
doc.length = doc.length,
vocab = vocab,
term.frequency = term.frequency
)
serVis(json, out.dir = 'vis',
open.browser = FALSE)
```
</small>
## Result
<small>
The result is published under this link
[http://r-addict.com/r-bloggers-harvesting/](http://r-addict.com/r-bloggers-harvesting/)
where you can check Intertopic Distance Map (via multidimensional scaling)
and top N relevant terms for a topic.
</small>