comparing.Rmd

---
title: "Comparing corpora"
author: "Wouter van Atteveldt"
date: "June 1, 2016"
output: pdf_document
---

```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```

Comparing corpora
----

Another useful thing we can do is comparing two corpora: 
Which words or names are mentioned more in e.g. Bush' speeches than Obama's.

This uses functions from the corpustoools package, which you can install directly from github:
(you only need to do this once per computer)

```{r, eval=F}
install.packages("devtools")
devtools::install_github("kasperwelbers/corpus-tools")
```

For this handout, we will use the State of the Union speeches contained in the `corpustools` package,
and create a document term matrix (DTM) from all names and nouns in the speeches by Bush and Obama:

```{r, message=F}
library(corpustools)
data(sotu)
dtm = with(subset(sotu.tokens, pos1 %in% c("M", "N")),
           dtm.create(documents=aid, terms=lemma))
```

Now, we can create separate DTMs for Bush and Obama,
relying on the headline column in the metadata:

To do this, we split the dtm in separate dtm's for Bush and Obama.
For this, we select docment ids using the `headline` column in the metadata from `sotu.meta`, and then use the `dtm.filter` function:


```{r}
head(sotu.meta)
obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm.obama = dtm.filter(dtm, documents=obama.docs)
bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
dtm.bush = dtm.filter(dtm, documents=bush.docs)
```

So how can we check which words are more frequent in Bush' speeches than in Obama's speeches?
The function `corpora.compare` provides this functionality, given two document-term matrices:

```{r}
cmp = corpora.compare(dtm.obama, dtm.bush)
cmp = cmp[order(cmp$over), ]
head(cmp)
```

For each term, this data frame contains the frequency in the 'x' and 'y' corpora (here, Obama and Bush).
Also, it gives the relative frequency in these corpora (normalizing for total corpus size)
and the overrepresentation in the 'x' corpus and the chi-squared value for that overrepresentation.
So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant. 

Which words did Obama use most compared to Bush?

```{r}
cmp = cmp[order(cmp$over, decreasing=T), ]
head(cmp)
```

So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education. 

Let's make a word cloud of Obama' words, with size indicating chi-square overrepresentation:

```{r, warning=F}
obama = cmp[cmp$over > 1,]
dtm.wordcloud(terms = obama$term, freqs = obama$chi)
```

And the same for bush: (output not shown)

```{r, warning=F, eval=F}
bush = cmp[cmp$over < 1,]
dtm.wordcloud(terms = bush$term, freqs = bush$chi)
```

Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn't find a good place for them in 
the word cloud. 

\newpage

Contrast plots
====

Finally, we can use `plotWords` in the `corpustools` package to make a wordcloud-style plot with Obama's word on the right
and Bush' words on the left:


```{r}
with(arrange(cmp, -chi)[1:100, ],
plotWords(x=log(over), words = term, wordfreq = chi, random.y = T))
```