TM01_demographic_tweets.Rmd

---
title: "TM01_trump_tweets"
output: 
  html_notebook: 
    code_folding: hide
    number_sections: true
    fig_caption: yes
    highlight: zenburn
    theme: simplex
    toc: yes
editor_options: 
  chunk_output_type: inline
---

# Libraries
* `tidyverse` is a collection of several useful packages.
* `tidytext` implements tidy data principles to make many text mining tasks easier, more effective (Using `browseVignettes(package = "tidytext")` to get more information)
*  `tidytext` package imeplements the following functions including ... 
    * `get_sentiments()` to get a tidy data_frame of a single sentiemnt lexicon.
    * `cast_tdm()` to cast a data frame to a DocumentTermMatrix.
    * `unnest_tokens()` to tokenized sentence or text to words.
    * `bind_tf_idf()` to create tf and idf variables

```{r loading libraries}
library(tidyverse)
library(stringr)
library(tidytext)
```


* Setting global options firstly to ensure that character varaibles won't be converted to factor varaible.
```{r global options}
options(stringsAsFactors = FALSE)
```


# Loading data

* 2015-06-16 to elect
* 2016-05-03 in-party
* 2016-09-16 1st debate 
* 2016-11-08 2016 election

### varianeexplained's data

```{r}
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
tweets <- trump_tweets_df
names(tweets)
```

## trump's data over 2016 election

```{r}
# load("data/alltweets.RData")
raw.df <- readRDS("data/alltweets.rds")

raw.df %>% summary()
filtered.df <- raw.df %>%
    filter(!str_detect(text, '^"')) %>%
    filter(timestamp > as.POSIXct("2014-12-01") & 
               timestamp < as.POSIXct("2017-05-08"))
```


# Doc-Level analysis
* **Doc properites**
    * Number of document over time.
    * nchar distribution to confirm the property of docs compared with other social media.
* **User study** to confirm (whether) who are involved most in the data.
    * Posting frequency
    * Posting activity over time

## Number of tweets over time
* Adding four vertical lines to label different periods of elections. e.g., The first vline denotes the day of announcing to election.
* No special difference was found during different periods of election process.
```{r}
filtered.df %>%
    ggplot(aes(timestamp)) +
    geom_histogram(bins=120) + 
    geom_vline(xintercept = as.numeric(as.POSIXct(c("2015-06-16", "2016-05-03",
                                                    "2016-09-16", "2016-11-08"))), 
             color="red", alpha=0.5)


filtered.df %>%
    mutate(weeks = cut(timestamp, breaks="week")) %>%
    count(weeks) %>%
    ggplot(aes(as.Date(weeks), n)) +
    geom_col() 

```

## by hour


```{r}
library(lubridate)
filtered.df %>%
    mutate(hm = hour(timestamp) + minute(timestamp)/60) %>%
    ggplot(aes(timestamp, hm)) +
    geom_point(color = "royalblue", alpha = 0.1, shape = 15) + 
    geom_vline(xintercept = as.numeric(as.POSIXct(c("2015-06-16", "2016-05-03",
                                                    "2016-09-16", "2016-11-08"))), 
             color="red", alpha=0.5)
```


## nchar distribution
> ... We want every person around the world to easily express themselves on Twitter, so we're doing something new: we're going to try out a longer limit, 280 characters, in languages impacted by cramming (which is all except Japanese, Chinese, and Korean). https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet


```{r}
filtered.df %>%
    mutate(nchar = nchar(text)) %>%
    # select(text, nchar, everything()) %>%
    count(nchar) %>%
    mutate(highfreq=ifelse(n > quantile(n, 0.9), 
                           "high", "other")) %>%
    ggplot(aes(nchar, n, fill = highfreq)) + 
    geom_col() + 
    xlab("number of character") + 
    scale_fill_manual(values=c("high"="tomato", 
                               "other"="gray"), guide=F)
```

# Word level analysis

## unnest text to word

```{r}
data(stop_words)
unnested.df <- filtered.df %>%
    mutate(text = str_replace_all(text, 
                                  "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
    # unnest_tokens(word, text, drop = FALSE) %>%
    unnest_tokens(word, text, 
                  token = "regex", pattern = "[^A-Za-z\\d#@']", 
                  drop=FALSE) 
# %>%
#     anti_join(stop_words)
```

## hot words

```{r}

unnested.df %>%
    count(word) %>%
    mutate(word = reorder(word, n)) %>%
    top_n(50, wt = n) %>%
    mutate(ismedia = ifelse(str_detect(word, "@.*|#.*"), 
                            "tag", 
                            "other")) %>%
    ggplot(aes(word, n, fill=ismedia)) + 
    geom_col() + 
    coord_flip() + 
    scale_fill_manual(values = c("tag" = "tomato", 
                                 "other" = "gray"), 
                      guide = F)
    
```

## freuqnecy of word
* The distribution follows an `power law distribution` (very few words occur very often, very many words occur very rare). The `Zipf law` says that the frequency of a word is reciprocal to its `rank (1 / r)`. To make the plot more readable, the axes can be logarithmized.

```{r}
unnested.df %>%
    count(word) %>%
    count(n) %>%
    ggplot(aes(n, nn)) + 
    geom_point(color = "royalblue", alpha=0.5) + 
    ggtitle("Word frequency distribution") + 
    xlab("word frequency") + ylab("Distribution") + 
    # theme(plot.title = element_text(hjust = 0.5)) + 
    scale_x_log10() + 
    scale_y_log10()
```

## n-words per tweets
* Does the trump often use fewer words that other politician?
```{r}
unnested.df %>%
    count(id_str) %>%
    ggplot(aes(n))  + 
    geom_histogram()
```

## Time series of selected words
```{r}
watched <- c("my", "our", "great", "you", "your", "I", "me")
unnested.df %>%
    filter(word %in% watched) %>%
    mutate(weeks = cut(timestamp, breaks="month")) %>%
    count(weeks, word) %>%
    group_by(weeks) %>%
    mutate(perc = n/sum(n))%>%
    ungroup() %>%
    ggplot(aes(as.POSIXct(weeks), perc, color = word)) +
    geom_line() + 
    geom_vline(xintercept = as.numeric(as.POSIXct(c("2015-06-16", "2016-05-03",
                                                    "2016-09-16", "2016-11-08"))), 
             color="red", alpha=0.5)
```


## wordcloud
* Mentioned in 陳世榮（2015）. Bock對於目前相當流行的文字雲（word cloud）批評指出， 文字雲並不具備科學或推論意義，它僅是基於「文字頻率代表某種意義」的假設下，提供 了讀者一種無限制的解讀（Bock, 2009）
。
```
pal_r <- brewer.pal(9, "PuRd")[-(1:2)]
wordcloud(words = y2016$word, freq = y2016$logratio, min.freq = 1,
	random.order = F, colors = pal_g, max.words = 100, rot.per = 0)
```

```{r}
library(wordcloud)

pal_g <- brewer.pal(9, "BuGn")[-(1:2)]


unnested.df %>%
    count(word) %>%
    with(wordcloud(word, n, max.words = 100, 
                   random.order = F, 
                   colors = brewer.pal(9, "BuGn")[-(1:2)], 
                   rot.per = 0))
```

## tf-idf
```{r}
tf_idf <- unnested.df %>% 
    anti_join(stop_words) %>%
    filter(!str_detect(word, "\\d")) %>%
    count(text, word, sort = T) %>%
    bind_tf_idf(word, text, n)

tf_idf %>%
    ggplot(aes(tf_idf)) + 
    geom_histogram(bins=100) + 
    scale_x_log10()
```

```{r}
library(janeaustenr)

book_words <- austen_books() %>%
    unnest_tokens(word, text) %>%
    count(book, word, sort = TRUE) %>%
    ungroup() %>%
    bind_tf_idf(word, book, n)
    
book_words %>%
    ggplot(aes(tf_idf)) +
    geom_histogram(bins = 100) + 
    scale_x_log10()
```