-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathcorpus.Rmd
201 lines (153 loc) · 8.27 KB
/
corpus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: 'Corpus analysis: the document-term matrix'
output: pdf_document
author: Wouter van Atteveldt
---
```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```
The most important object in frequency-based text analysis is the *document term matrix*.
This matrix contains the documents in the rows and terms (words) in the columns,
and each cell is the frequency of that term in that document.
In R, these matrices are provided by the `tm` (text mining) package.
Although this package provides many functions for loading and manipulating these matrices,
using them directly is relatively complicated.
Fortunately, the `RTextTools` package provides an easy function to create a document-term matrix from a data frame.
To create a term document matrix from a simple data frame with a 'text' column, use the `create_matrix` function (with removeStopwords=F to make sure all words are kept):
```{r,message=F}
library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)
```
We can inspect the resulting matrix m using the regular R functions to get e.g. the type of object and the dimensionality:
```{r}
class(m)
dim(m)
m
```
So, `m` is a `DocumentTermMatrix`, which is derived from a `simple_triplet_matrix` as provided by the `slam` package.
Internally, document-term matrices are stored as a _sparse matrix_:
if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don't occur in most documents).
Storing this as a regular matrix would waste a lot of memory.
In a sparse matrix, only the non-zero entries are stored, as 'simple triplets' of (document, term, frequency).
As seen in the output of `dim`, Our matrix has only 2 rows (documents) and 6 columns (unqiue words).
Since this is a fairly small matrix, we can visualize it using `as.matrix`, which converts the 'sparse' matrix into a regular matrix:
```{r}
as.matrix(m)
```
Stemming and stop word removal
-----
So, we can see that each word is kept as is.
We can reduce the size of the matrix by dropping stop words and stemming (changing a word like 'chickens' to its base form or stem 'chicken'):
(see the create_matrix documentation for the full range of options)
```{r}
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
dim(m)
as.matrix(m)
```
As you can see, the stop words (_the_ and _are_) are removed, while the two verb forms of _to eat_ are joined together.
In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English.
Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German.
An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix,
which gives the total frequency of each term:
```{r}
colSums(as.matrix(m))
```
For more richly inflected languages like Dutch, the result is less promising:
```{r}
text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
```
As you can see, _de_ and _hebben_ are correctly recognized as stop words, but _gegeten_ (eaten) and _kippen_ (chickens) have a different stem than _eet_ (eat) and _kip_ (chicken). German gets similarly bad results.
Loading and analysing a larger dataset from AmCAT
-----
If we want to move beyond stemming, one option is to use AmCAT to parse articles.
Before we can proceed, we need to save our AmCAT password (only needed once, please don't save the password in your file!)
and log onto amcat:
```{r, message=FALSE, eval=FALSE}
library(amcatr)
amcat.save.password("https://amcat.nl", "username", "password")
conn = amcat.connect("https://amcat.nl")
```
```{r, echo=FALSE, message=F}
library(amcatr)
conn = amcat.connect("https://amcat.nl")
```
Now, we can upload articles from R using the `amcat.upload.articles` function, which we now demonstrate with a single article but which can also be used to upload many articles at once:
```{r}
articles = data.frame(text = "John is a great fan of chickens, and so is Mary", date="2001-01-01T00:00", headline="test")
aset = amcat.upload.articles(conn, project = 1, articleset="Test", medium="test",
text=articles$text, date=articles$date, headline=articles$headline)
```
And we can then lemmatize this article and download the results directly to R
using `amcat.gettokens`:
```{r}
tokens = amcat.gettokens(conn, project=1, articleset = aset, module = "corenlp_lemmatize")
head(tokens)
```
And we can see that e.g. for "is" the lemma "be" is given.
For a more serious application, we will use an existing article set: [set 16017](https://amcat.nl/navigator/projects/559/articlesets/16017/) in project 559, which contains the state of the Union speeches by Bush and Obama (each document is a single paragraph)
This data is available directly from the corpustools package:
```{r, message=F}
library(corpustools)
data(sotu)
nrow(sotu.tokens)
head(sotu.tokens, n=20)
```
As you can see, the result is similar to the ad-hoc lemmatized tokens, but now we have around 100 thousand tokens rather than 6.
We can create a document-term matrix using the dtm.create command from `corpustools`:
```{r, warning=FALSE}
dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma, filter=sotu.tokens$pos1 %in% c("M","N"))
dtm
```
So, we now have a "sparse" matrix of almost 7,000 documents by more than 70,000 terms.
Sparse here means that only the non-zero entries are kept in memory,
because otherwise it would have to keep all 70 million cells in memory (and this is a relatively small data set).
Thus, it might not be a good idea to use functions like `as.matrix` or `colSums` on such a matrix,
since these functions convert the sparse matrix into a regular matrix.
The next section investigates a number of useful functions to deal with (sparse) document-term matrices.
Corpus analysis: word frequency
-----
What are the most frequent words in the corpus?
As shown above, we could use the built-in `colSums` function,
but this requires first casting the sparse matrix to a regular matrix,
which we want to avoid (even our relatively small dataset would have 400 million entries!).
However, we can use the `col_sums` function from the `slam` package, which provides the same functionality for sparse matrices:
```{r}
library(slam)
freq = col_sums(dtm)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)
```
As can be seen, the most frequent terms are America and recurring issues like jobs and taxes.
It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms).
The function `term.statistics` from the `corpus-tools` package provides this functionality:
```{r}
terms = term.statistics(dtm)
terms = terms[order(-terms$termfreq), ]
head(terms, 10)
```
As you can see, for each word the total frequency and the relative document frequency is listed,
as well as some basic information on the number of characters and the occurrence of numerals or non-alphanumeric characters.
This allows us to create a 'common sense' filter to reduce the amount of terms, for example removing all words containing a letter or punctuation mark, and all short (`characters<=2`) infrequent (`termfreq<25`) and overly frequent (`reldocfreq>.5`) words:
```{r}
subset = terms[!terms$number & !terms$nonalpha & terms$characters>2 & terms$termfreq>=25 & terms$reldocfreq<.25, ]
nrow(subset)
head(subset, n=10)
```
This seems more to be a relatively useful set of words.
We now have about 8 thousand terms left of the original 72 thousand.
To create a new document-term matrix with only these terms,
we can use normal matrix indexing on the columns (which contain the words):
```{r}
dtm_filtered = dtm.filter(dtm, terms=subset$term)
dim(dtm_filtered)
```
Which yields a much more managable dtm.
As a bonus, we can use the `dtm.wordcloud` function in corpustools (which is a thin wrapper around the `wordcloud` package)
to visualize the top words as a word cloud:
```{r, warning=F, eval=F}
dtm.wordcloud(dtm_filtered)
```