-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathWeb Scraping Workshop.Rmd
314 lines (221 loc) · 12.1 KB
/
Web Scraping Workshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: "Web Scraping Workshop"
author: "Sarah King"
date: "`r Sys.Date()`"
output: html_document
---
# Introduction
- Web scraping is a technique for efficiently collecting and organizing information from websites. Although these data can be collected manually, automation saves time and is less error-prone.
- In this workshop, participants will use R to automate the web scraping process. Participants will learn about the general structure of a typical web page and how to use the `rvest` package to select elements, such as text fields and tables, and iteratively extract relevant data.
- All of the materials for the workshop (slides & R Script) can be found in this GitHub repository (https://github.com/sarahashleyking/ConnectedPolitics-Scraping-Workshop.git)
# Outline of Content
- Introduction of necessary packages
- Important functions
- FOR loop
- HTML: The front-end syntax
- Selector Gadget
- Scraping multiple pages/tables/data that is not on the specified page
- Brief QTA example
- Caveats/Conclusion
# Packages
- [tidyverse](https://www.tidyverse.org/) - The tidyverse is an opinionated collection of R packages designed for data science. Necessary for data cleaning/wrangling.
- [rvest](https://rvest.tidyverse.org/) - (a part of the tidyverse) necessary for the actual web-scraping/crawling
```{r echo=TRUE}
library(rvest)
library(tidyverse)
```
# Important Functions
- `data.frame()` creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.
- `rbind()/cbind()` takes a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively.\
# Important Functions II
- `paste()` / `paste0()` concatenate vectors after converting to character.
- `str_sub()` takes a portion of a string and `str_remove()` removes a portion of a string.\
```{r, echo = TRUE}
str_sub("Sarah", 1, 2)
```
```{r, echo = TRUE}
str_remove("Sarah", "S")
```
```{r, echo = TRUE}
paste("Sarah", "King")
```
```{r, echo = TRUE}
paste("Sarah", "King", sep = "_")
```
```{r, echo = TRUE}
paste0("Sarah", "King")
```
```{r, echo = TRUE}
month = "Nov"
year = "2021"
paste0(month, "-", year)
```
# FOR Loop
- A for-loop is a control flow statement for specifying iteration, which allows code to be executed repeatedly. The basic structure of a for-loop is:
`for (variable in sequence) {`
`expression`
`}`
```{r, echo = TRUE}
print("Monday")
print("Tuesday")
print("Wednesday")
print("Thursday")
print("Friday")
print("Saturday")
print("Sunday")
```
```{r, echo = TRUE}
days = c("Monday", "Tuesday", "Wedndesday", "Thursday", "Friday", "Saturday", "Sunday")
for (i in 1:7) {
print(days[i])
}
```
# HTML: The front-end syntax
Most, if not all, websites use some form of HTML. It is the default syntax language to design webpages along with CSS to edit the layout and Javascript to make dynamic pages. Webpages are based on HTML elements.
These are nodes written using a tag in the HTML document. `html, head, title, body, h2, p` are all elements because they are represented by tags. We can see these elements by viewing the source code of the webpage. Tags (or elements) are used to select which part of the webpage to scrape.
\
Example 1: **https://www.sarahasking.com/**\
To start scraping, you first need to store the HTML code of the webpage in a variable. We do that by using `read_html()`
```{r, echo = TRUE}
read_html("https://www.sarahasking.com/")
```
Then we look for the HTML tag or element that we want to select. We use `html_nodes()` to select the part of the webpage that we want to scrape. Let's try `h3`
```{r, echo = TRUE}
link = "https://www.sarahasking.com/"
link %>% read_html() %>% html_nodes("h3")
```
Different types of elements can be scraped off a webpage. They can be text elements, tables, links, etc. To scrape text, the function `html_text()` is used.
```{r, echo = TRUE}
link = "https://www.sarahasking.com/"
output = link %>% read_html() %>% html_nodes("h3") %>% html_text()
output
```
`Stringr` functions can be useful here to clean the output data: `str_remove()`, `str_sub`, etc.
```{r, echo=TRUE}
str_remove(output, "-")
```
\
What if we would like to scrape the links? They can be very useful, especially if you're scraping multiple pages.\
The "`a`" tag and the "`href`" attribute are used to insert hyperlinks in HTML.\
```{r, echo = TRUE}
link = "https://www.sarahasking.com/"
output = link %>% read_html() %>% html_nodes("a") %>% html_attr("href")
output
```
Notice how this retrieves ALL the links on the webpage. What if I need only certain ones? HTML IDs and classes are used to identify different sections and elements on a webpage. IDs should be preceded by \# and classes. Thankfully there is a tool to help us determine the exact tag for the specific element we want to scrape.
Let's take a look at the webpage code using a different method.
```{r, echo=TRUE}
link = "https://www.sarahasking.com/"
output = link %>% read_html() %>% html_nodes("#comp-kolsjn90 a") %>% html_attr("href")
output
```
# Selector Gadget
It is very useful to have a broad understanding of how HTML tags and elements work. But there's a tool that we can use to select different elements of a webpage without having to go through all the code: **https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb**
\
Let's see how this works.\
Example 2: [UCD SPIRe PhD Candidates](https://www.ucd.ie/spire/about/phdcandidates/)
```{r, echo = TRUE}
link = "https://www.ucd.ie/spire/about/phdcandidates/"
link %>% read_html() %>% html_nodes("p") %>% html_text()
```
```{r, echo = TRUE}
link = "https://www.ucd.ie/spire/about/phdcandidates/"
link %>% read_html() %>% html_nodes("p:nth-child(1) strong") %>% html_text()
```
# Scraping multiple pages
'for' loops come into play when we need to scrape multiple pages in one call.\
First, I'll need to examine the structure of the URL and see if there are any patterns (which is usually the case). The first page needs to be scraped separately because its URL is usually different and you also need an initial data frame to add the rest of the data to.\
Example 3:[Trustpilot-Amazon](https://www.trustpilot.com/review/www.amazon.com)\
In this example, we'll also see how to scrape multiple fields on a single page and use the `seq()` function.\
Scraping the first page:
```{r, echo=TRUE}
link = "https://www.trustpilot.com/review/www.amazon.com"
name = link %>% read_html() %>% html_nodes(".styles_consumerDetails__ZFieb .typography_appearance-default__AAY17") %>% html_text()
reviews = link %>% read_html() %>% html_nodes(".typography_body-l__KUYFJ.typography_color-black__5LYEn") %>% html_text()
reviews_complete = data.frame(name, reviews, stringsAsFactors = FALSE)
```
Scraping the next 10 pages:
```{r, echo=TRUE}
for (i in seq(from = 2, to = 9, by = 1)) {
link = paste0("https://www.trustpilot.com/review/www.amazon.com?page=", i)
name = link %>% read_html() %>% html_nodes(".styles_consumerDetails__ZFieb .typography_appearance-default__AAY17") %>% html_text()
reviews = link %>% read_html() %>% html_nodes(".typography_body-l__KUYFJ.typography_color-black__5LYEn") %>% html_text()
temp = data.frame(name, reviews, stringsAsFactors = FALSE)
reviews_complete = rbind(reviews_complete, temp)
rm(temp)
}
```
# What can we do with the data?
Since we have scraped the full text of the reviews for Amazon from Trustpilot, we can perform some basic natural language processing techniques. Using the `quanteda` package, let's run a very simple sentiment analysis on the reviews we have scraped from the fist 9 pages.
```{r, echo=TRUE}
library(quanteda)
```
```{r, echo=TRUE}
reviews_dfm <- corpus(reviews_complete, text_field = "reviews") %>% tokens(remove_punct = TRUE) %>% tokens_select(pattern = stopwords("en"), selection = "remove") %>% dfm()
topfeatures(reviews_dfm, n = 50)
```
```{r}
reviews_corpus <- corpus(reviews_complete, text_field = "reviews")
reviews_sentiment <- tokens(reviews_corpus) %>%
tokens_lookup(data_dictionary_LSD2015)%>%
dfm() %>%
convert(to = "data.frame")
reviews_sentiment <- reviews_sentiment %>%
mutate(sentiment = log((positive + neg_negative + 0.5) /
(negative +neg_positive + 0.5)))
summary(reviews_sentiment$sentiment)
```
# Scraping a table
`html_table()` is used to retrieve complete tables.\
Seeing a table doesn't necessarily mean that there's one. A table is a type of element in HTML and you need to see the table 'table' tag in the code. If it has an ID or a class, we use them. Otherwise, we don't specify any HTML nodes.\
Let's examine this page.
Table Scraping Example 4: [U.S. Polling Presidential Election 2020](https://en.wikipedia.org/wiki/Polling_for_United_States_presidential_elections#2020)
```{r, echo = TRUE}
link = "https://en.wikipedia.org/wiki/Polling_for_United_States_presidential_elections#2020"
table = link %>%
read_html() %>%
html_node(".wikitable:nth-child(74)") %>% html_table() %>%
as.data.frame()
```
Table Scraping Example 5: [German Election Polling 2021-2023](https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_German_federal_election)
```{r, echo = TRUE}
url = read_html("https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_German_federal_election")
depolls = url %>% html_table()
```
```{r, echo=TRUE}
depolls23= depolls[[1]]
depolls22= depolls[[2]]
depolls21= depolls[[3]]
```
# Scraping data that is not on the specified webpage
Sometimes, you need to scrape data on a parent page and a child page simultaneously. This is when `html_attr()` comes into play.
Example 6: [House of the Oireachtas Parliamentary Questions](https://www.oireachtas.ie/en/debates/questions/)\
For this workshop, I will filter the results to only see questions asked during the first week of [July 2020](https://www.oireachtas.ie/en/debates/questions/?questionType=all&datePeriod=dates&fromDate=01%2F07%2F2020&toDate=07%2F07%2F2020&term=%2Fie%2Foireachtas%2Fhouse%2Fdail%2F33&departmentToggle=member&member=&department=&depFrom=&depTo=&viewBy=question)\
(Always examine the URLs when scraping multiple pages!) The first page is done separately:
```{r, echo = TRUE}
link = "https://www.oireachtas.ie/en/debates/questions/?questionType=all&datePeriod=dates&fromDate=01%2F07%2F2020&toDate=07%2F07%2F2020&term=%2Fie%2Foireachtas%2Fhouse%2Fdail%2F33&departmentToggle=member&member=&department=&depFrom=&depTo=&viewBy=question"
questlinks = link %>% read_html() %>% html_nodes(".u-btn-secondary") %>% html_attr("href")
questlinks
```
Notice how the URLs don't contain the main domain? Let's try again.
```{r, echo = TRUE}
link = "https://www.oireachtas.ie/en/debates/questions/?questionType=all&datePeriod=dates&fromDate=01%2F07%2F2020&toDate=07%2F07%2F2020&term=%2Fie%2Foireachtas%2Fhouse%2Fdail%2F33&departmentToggle=member&member=&department=&depFrom=&depTo=&viewBy=question"
questlinks = link %>% read_html() %>% html_nodes(".u-btn-secondary") %>% html_attr("href") %>% paste0("https://www.oireachtas.ie", .)
questlinks
```
We now have the secondary links, let's retrieve the data for the first question of the first page:
```{r, echo = TRUE}
td = questlinks[1] %>% read_html() %>% html_nodes("#pq_1 .c-avatar__name-link") %>% html_text()
question = questlinks[1] %>% read_html() %>% html_nodes("#pq_1 p") %>% html_text()
answer = questlinks[1] %>% read_html() %>% html_nodes(".speech .text") %>% html_text()
questions = data.frame(td, question, answer, stringsAsFactors = FALSE)
```
# Caveats
- DDoS attacks.
- A distributed denial-of-service (DDoS) attack is a malicious attempt to disrupt the normal traffic of a targeted server, service or network by overwhelming the target or its surrounding infrastructure with a flood of Internet traffic.
- Sys.sleep()
- Robots.txt
- [Washington Post](https://www.washingtonpost.com/robots.txt)
- [Twitter](https://twitter.com/robots.txt)
- [TripAdvisor](https://www.tripadvisor.com/robots.txt)
- `rvest` in concert with [`polite`](https://dmi3kno.github.io/polite/). The polite package ensures that you're respecting the robots.txt and not hammering the site with too many requests.