-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c84f1a7
commit ac3fb38
Showing
7 changed files
with
1,205 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
--- | ||
title: "Using R Markdown" | ||
author: "Martyn Egan" | ||
date: "2023-02-01" | ||
output: html_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
## R markdown | ||
|
||
R markdown is a very convenient and useful tool for communicating R output. It allows us to write text directly into a file, which will then be rendered into a presentable format, and it also allows us to embed R code into our document, which will be processed when we "knit" the document, with the results presented in the "knitted" output. | ||
|
||
We have used R markdown files extensively in the course so far, so hopefully the format is now familiar to you. If you need a refresher on the basics of R markdown, this [cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) from R studio is very helpful. I suggest you download it and keep it open on your desktop while completing your homeworks. | ||
|
||
## R markdown and project reporting | ||
|
||
I want to emphasise to all of you that the purpose of R markdown is to *communicate* results. It is not intended to experiment with code or to run very long and complex models. That is what R scripts are for. You should limit your code in R markdown to what is necessary for *communication*. In the case of your homeworks, you need to communicate: | ||
|
||
1. That you know the correct code. | ||
2. That you have produced the correct results. | ||
|
||
To do this, *you do not necessarily need to produce the results from the code inside the R markdown file*. Instead: | ||
|
||
- You can run the code in an R script and save the output using the `saveRDS()` function. | ||
- You can then read that output into the R markdown file using the `readRDS()` function inside a code chunk. | ||
- You can include the original code that you used to create the output inside a separate chunk **which does not evaluate during knitting** by adding the `eval = FALSE` option to the brackets above that chunk. | ||
|
||
You do not need to do this for every line of code, and most code is fine to run inside R markdown. You should use your own judgement though to determine which code it is foolish to run inside R markdown: very intensive machine learning algorithms will certainly fall into this category. | ||
|
||
In general, I recommend you: | ||
|
||
1. Experiment with your code first in a separate R script. | ||
2. Once you have the correct, minimal code necessary, create a single R script which contains the complete workflow of the analysis. | ||
3. If running this script produces the correct results, you can then save any objects which result from computationally intensive algorithms using `saveRDS()`. | ||
4. Finally, you can copy-paste the relevant sections of code from your final R script into the R markdown file, remembering to include the `eval = FALSE` option for those chunks which contain computationally intensive algorithms, and instead using `readRDS()` in a separate chunk to read in the object which you already created. | ||
|
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
title: "Tutorial Guide, QTA Wk 2" | ||
author: "Martyn Egan" | ||
date: "2023-02-03" | ||
output: html_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
## Getting Started with Quanteda | ||
|
||
Today we will be looking at using the `Quanteda` package for the initial steps of the QTA workflow, from corpus acquisition through to the document-feature matrix. Quanteda contains a powerful set of functions for performing QTA, but we'll also need to use a couple of other packages to both import and wrangle our data. | ||
|
||
## Learning Outcomes | ||
|
||
By the end of today's tutorial we should be able to: | ||
|
||
1. Acquire a corpus using a web API | ||
2. Pre-process the corpus | ||
3. Create the document-feature matrix (dfm) | ||
|
||
## Case study: the Ukraine war | ||
|
||
No cricket this week, sadly. Instead, we're going to analyse the Guardian's coverage since the start of the year. Has coverage changed over that period? If so, how? As ever, we'll find that before we can answer the questions we have of our data, we'll be spending most of our effort trying to get it into the right shape to do so. | ||
|
||
## Using a web API | ||
|
||
Last week we looked at how to scrape text from static web pages using html and xpaths. Today we're going to take a step up and try acquiring text through a web API (Application Programming Interface). The API we will use is for [The Guardian](https://www.theguardian.com/) newspaper. Fun fact: The Guardian was originally called the Manchester Guardian, until it sold out and moved to London. | ||
|
||
### Step 1: Getting an API key | ||
|
||
The first step in acquiring our data is to get an API key. We'll need this to gain access to The Guardian's data. Click [here](https://bonobo.capi.gutools.co.uk/register/developer) and fill out the form. | ||
|
||
### Step 2: Getting the data | ||
|
||
The Guardian's API works by submitting a request to a web address. We can do this in a browser or with the help of a package. For today's class we'll use the helpful `guardianapi` package, which automates some of the process for us. Open your R script for today's class, `tutorial02.R` in the `code` repository. | ||
|
||
## Next time... | ||
|
||
Today's class took us through the heavy loading of acquiring and pre-processing our corpus. You *must* properly pre-process your corpus if you want your machine learning models to run in a reasonable time frame and produce meaningful results. As a rule of thumb, if the model for your problem set is taking longer than an hour or so, you didn't properly pre-process your corpus, and so your dfm is too large. | ||
|
||
Next week we'll revise this process and move on to looking at a few statistics regarding the corpus and the dfm. | ||
|
||
## Problem sets | ||
|
||
A final note: before completing the homeworks for this module please read the guide to using markdown in the top level of the repository. Your homework *must* be compiled as a html; if you submit an uncompiled `.Rmd` file your work will not be graded, as it will not contain any output. | ||
|
||
Not being able to compile your `.Rmd` into a html is not an excuse for submitting an `.Rmd`: if it won't compile for you, it won't compile for me either. |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
####################################### | ||
# Tutorial 2: APIs and pre-processing # | ||
####################################### | ||
|
||
## Load packages | ||
pkgTest <- function(pkg){ | ||
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] | ||
if (length(new.pkg)) | ||
install.packages(new.pkg, dependencies = TRUE) | ||
sapply(pkg, require, character.only = TRUE) | ||
} | ||
|
||
lapply(c("tidyverse", | ||
"guardianapi", # for working with the Guardian's API | ||
"quanteda", # for QTA | ||
"quanteda.textstats", # more Quanteda! | ||
"quanteda.textplots", # even more Quanteda! | ||
"readtext", # for reading in text data | ||
"stringi", # for working with character strings | ||
"textstem" # an alternative method for lemmatizing | ||
), pkgTest) | ||
|
||
### A. Using the Guardian API with R | ||
gu_api_key() # run this interactive function | ||
|
||
# We want to query the API on articles featuring Ukraine since Jan 1 2023 | ||
dat <- gu_content(query = "", from_date = "") # making a tibble | ||
|
||
# We'll save this data | ||
saveRDS(dat, "data/df2023") | ||
# And make a duplicate to work on | ||
df <- dat | ||
|
||
# Take a look at the kind of object which gu_content creates. | ||
# Try to find the column we need for our text analyses | ||
head(df) # checking our tibble | ||
|
||
df <- df[] # see if you can subset the object to focus on the articles we want | ||
|
||
which(duplicated(df$web_title) == TRUE) # sometimes there are duplicates... | ||
df <- df[!duplicated(df$web_title),] # which we can remove | ||
|
||
### B. Making a corpus | ||
# We can use the corpus() function to convert our df to a quanteda corpus | ||
corpus_ukr <- corpus(df, | ||
docid_field = "", | ||
text_field = "") # select the correct column here | ||
|
||
# Checking our corpus | ||
summary(corpus_ukr, 5) | ||
|
||
### C. Pre-processing | ||
## 1. Cleaning the text with regexs and stringi | ||
|
||
# Let's take a look at the first article and see if we can spot any big problems | ||
as.character(corpus_ukr)[1] | ||
|
||
# It looks like each text includes the headline, with the body inside "". We might | ||
# decide we only want the body text, in which case we'd need to get rid of everything | ||
# before the first ". We can use the stringi package to help with this. | ||
test <- as.character(corpus_ukr)[1] # make a test object | ||
|
||
stri_replace_first(test, | ||
replacement = "", # nothing here (i.e. we're removing) | ||
regex = "") #try to write the correct regex - this may help: https://www.rexegg.com/regex-quickstart.html | ||
|
||
# Sometimes there's also boilerplate at the end of an article after a big centre dot. | ||
as.character(corpus_ukr)[which(grepl("\u2022.+$", corpus_ukr))[1]] | ||
|
||
# We could get rid of all that too with a different function | ||
test <- as.character(corpus_ukr)[which(grepl("\u2022.+$", corpus_ukr))[1]] | ||
stri_replace_last(test, | ||
replacement = "", | ||
regex = "\u2022.+$") | ||
|
||
# These might be useful to out analysis though, so for now we'll keep them in. | ||
|
||
## 2. Tokenize the text | ||
# The next step is to turn all the words into tokens. | ||
# The tokens() function can also remove punctuation and symbols for us. | ||
toks <- quanteda::tokens(corpus_ukr, | ||
remove_punct = TRUE, | ||
remove_symbols = TRUE) | ||
|
||
## 3. Lowercase the text | ||
toks <- tokens_tolower(toks) # lowercase tokens | ||
print(toks[10]) # print lowercase tokens from the 10th article in corpus. | ||
|
||
## 4. Remove stop words | ||
# Now we can use a dictionary to remove stop words, such as articles, etc. | ||
# We do this using quanteda's built-in stopwords() function and the "english" | ||
# dictionary. | ||
|
||
#Let's have a quick look at these. | ||
stop_list <- stopwords("english") # load English stopwords from quanteda | ||
head(stop_list) # show first 6 stopwords from stopword list. | ||
|
||
# Notice how these stopwords are also lowercased. | ||
|
||
# The tokens_remove() function allows us to apply the stop_list to our toks object | ||
toks <- tokens_remove(toks, stop_list) | ||
|
||
toks[10] # print list of tokens from 10th article without stop words. | ||
|
||
# Notice how much shorter the list is now. Can you imagine how much longer it | ||
# might take to run your model if you don't do this bit properly... | ||
|
||
## 5.a. Normalising (or stemming) the tokens | ||
# Now we'll stem the words using the tokens_wordstem() function | ||
stem_toks <- tokens_wordstem(toks) | ||
|
||
stem_toks[10] # print stemmed tokens from 10th document - notice any differences? | ||
|
||
## 5.b. Lemmatizing - an alternative | ||
# An alternative normalization technique is to collapse different inflections of | ||
# a word to a root form. We'll use the textstem package to do this. | ||
|
||
# i. Convert quanteda tokens object to list of tokens | ||
toks_list <- as.list(toks) | ||
|
||
# ii. Apply the lemmatize_words function from textstem to the list of tokens | ||
lemma_toks <- lapply(toks_list, lemmatize_words) | ||
|
||
# iii. Convert the list of lemmatized tokens back to a quanteda tokens object | ||
lemma_toks <- as.tokens(lemma_toks) | ||
|
||
# Compare article 10 in toks, stem_toks and lemma_toks: what do you notice? | ||
# Which is smallest? | ||
|
||
## 6. Detect collocations | ||
# Collocations are groups of words (grams) that are meaningful in combination. | ||
# To identify collocations we use the quanteda textstats package | ||
|
||
# i. Identify collocations | ||
collocations <- textstat_collocations(lemma_toks, size = 2) | ||
|
||
# ii. Choose which to keep | ||
keep_coll_list <- collocations$collocation[1:20] | ||
keep_coll_list | ||
|
||
# iii. Apply to tokens object | ||
comp_tok <- tokens_compound(lemma_toks, keep_coll_list) | ||
|
||
### D. Creating the document-feature matrix (dfm) | ||
# Now that we've finished pre-processing our tokens object, we can convert it | ||
# into a dfm using quanteda's dfm() function | ||
|
||
# Convert to dfm... | ||
dfm_ukr <- dfm(comp_tok) | ||
|
||
# ...and save | ||
saveRDS(dfm_ukr, "data/dfm") | ||
|
||
# We'll leave operations on the dfm until next time, but to give a preview, here are | ||
# some functions we can use to analyse the dfm. | ||
topfeatures(dfm_ukr) | ||
|
||
# We can also visualise the dfm using the textplots package from quanteda | ||
dfm_ukr %>% | ||
dfm_trim(min_termfreq = 3) %>% | ||
textplot_wordcloud(min_size = 1, max_size = 10, max_words = 100) | ||
|
||
### Activity | ||
# In this week's \data repository, you'll find a file called df2022. This is | ||
# an extract of articles from January 2022, shortly before the war began. | ||
# See if you can repeat pre-processing on this data, and compare the features | ||
# and wordcloud that results. | ||
|
||
df2022 <- readRDS("data/df2022") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
## README | ||
|
||
Just a placeholder |
Binary file not shown.