-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathwebscraping-intro.Rmd
215 lines (130 loc) · 9.92 KB
/
webscraping-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Web Scraping with `rvest`"
author: "brunj7"
knit: (function(input_file, encoding) {
out_dir <- 'docs';
rmarkdown::render(input_file,
encoding=encoding,
output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
output:
html_document:
theme: "spacelab"
# df_print: "paged"
toc: true
toc_depth: 2
toc_float: true
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(DT)
```
# Goal
The Goal of this session is to learn how to get data from the World Wide Web using R. Although we are going to talk about a few concepts first, the core of this session will be spent on getting data from websites that do not offer any interface to automate information retrieval, like via Web services such as REST, SOAP nor application programming interfaces (APIs). Therefore it is necessary to `scrape` the information embedded in the website itself.
When you want to extract information or download data from a website that is too large for efficient manual downloading or needs to be frequently updated, you should first:
1. Check if the website has any available Web services or if APIs have been developed to this end
2. Check if any R (or other language you know) package has been developed by others as a wrapper around the API to facilitate the use of these Web services
3. Nothing found? Well let's code this ourselves then!
# Which R packages are available?
As usual, it is a good start to look at the CRAN View to have an idea of R packages available: <https://CRAN.R-project.org/view=WebTechnologies>
Here are some of the key packages:
- `Rcurl`: low level wrapper for `libcurl` that provides convenient functions to allow you to fetch URIs, get & post forms; [Quick guide](http://www.omegahat.net/RCurl/philosophy.html).
- `httr`: similar to Rcurl; provides a user-friendly interface for executing HTTP methods and provides support for modern web authentication protocols (OAuth 1.0, OAuth 2.0). It is a wrapper around the [`curl` package](https://cran.r-project.org/web/packages/curl/index.html)
- `rvest`: a higher level package mostly based on httr. It is simpler to use for basic tasks.
- `Rselenium`: can be used to automate interactions and extract page content from dynamically generated webpages (i.e., those requiring user interaction to display results like clicking on button)
There are also function in the `utils` package, such as `download.file()`. Note that these functions do not handle https (motivation behind the `curl` R package)
**In this session we are going to use `rvest`: first for a simple tutorial, followed by a challenge in groups**
# Some background
## HTTP: Hypertext Transfer Protocol
### URL
At the heart of web communications is the request message, which is sent via *U*niform *R*esource *L*ocators (URLs). Basic `URL` structure:

The protocol is typically http or https for secure communications. The default port is 80, but one can be set explicitly, as illustrated in the above image. The resource path is the local path to the resource on the server.
### Request

The actions that should be performed on the host are specified via HTTP verbs. Today we are going to focus on two actions that are often used in web forms:
- `GET`: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
- `POST`: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
### Response
Status codes:
- `1xx`: Informational Messages
- `2xx`: Successful; most known is 200: OK, request was successfully processed
- `3xx`: Redirection
- `4xx`: Client Error; the famous 404: resource not found
- `5xx`: Server Error
## Document Types
### HTML
The *H*yper*T*ext *M*arkup *L*anguage (`HTML`) describes and defines the content of a webpage. Other technologies besides HTML are generally used to describe a webpage's appearance/presentation (CSS) or functionality (JavaScript).
"Hyper Text" in HTML refers to links that connect webpages to one another, either within a single website or between websites. Links are a fundamental aspect of the Web.
HTML uses "markup" to annotate text, images, and other content for display in a Web browser. HTML markup includes special "elements" such as `<head>`, `<title>`, `<body>`, `<header>`, `<footer>`, `<article>`, `<section>`, `<p>`, `<div>`, `<span>`, `<img>`, and many others.
Using you web browser, you can inspect the HTML content of any webpage of the World Wide Web.
### XML
The e*X*tensible *M*arkup *L*anguage (XML) provides a general approach for representing all types of information, such as data sets containing numerical and categorical variables. XML provides the basic, common, and quite simple structure and syntax for all “dialects” or vocabularies. For example, `HTML`, `SVG` and `EML` are specific vocabularies of XML.
## How to access information
### XPath
`XPath` is quite simple but yet very powerful. Similar syntax to a file system hierarchy, it allows to identify nodes of interest by specifying paths through the tree, based on node names, node content, and a node’s relationship to other nodes in the hierarchy. We typically use XPath to locate nodes in a tree and then use R functions to extract data from those nodes and bring the data into R.
### CSS
*C*ascading *S*tyle *S*heets (`CSS`) is a stylesheet language used to describe the presentation of a document written in HTML or XML. CSS describes how elements should be rendered on screen, on paper, in speech, or on other media. In CSS, **selectors** are used to target the HTML elements on a web page that we want to style. There are a wide variety of CSS selectors available, allowing for fine grained precision when selecting elements to style.
**Want to train on how to use CSS to select objects? Here is your interactive site: http://flukeout.github.io/**
# Web scraping workflow
1. Check for existing API and existing R packages
2. Information identification: use your web browser inspector and/or http://selectorgadget.com/ to inspect the content and structure of the webpages you want to extract information from
3. Choice of strategy: e.g. Xpath, CSS selector, ...
4. Information extraction: Choose the relevant R package(s) to accomplish your data extraction and code it
<br>
# `rvest`
`rvest` is a set of wrappers functions around the `xml2` and `httr` packages
Main functions:
- `read_html`: read a webpage into R as XML (document and nodes)
- `html_nodes`: extract pieces out of HTML documents using XPath and/or CSS selectors
- `html_attr`: extract attributes from HTML, such as `href`
- `html_text`: extract text content
For more information on the package: [here](https://github.com/tidyverse/rvest)
## Quick example
In this example we are going to extract information about Santa Barbara's wine shops from https://santabarbaraca.com/. This website collects a lot of information on how to best visit our town. For example they have information about The [*Funk Zone*](http://santabarbaraca.com/explore-and-discover-santa-barbara/neighborhoods-towns/santa-barbara/the-funk-zone/). Here is their listing of the local wine shops <https://santabarbaraca.com/plan-your-trip/wine/wine-shops/>
We are going to scrape the name of the wine shops and their websites out of this web page and compile this information into a csv to share with our friends!
### Look at the website structure using our web browser inspector:

### Use this information to extract the wine shop's names from the webpage:
s
```{r, message=FALSE, warning=FALSE, error=FALSE}
#install.packages("rvest")
library("rvest")
URL <- "https://santabarbaraca.com/plan-your-trip/wine/wine-shops/"
# Read the webpage into R
webpage <- read_html(URL)
# Parse the webpage for bars
wine_listing <- html_nodes(webpage, ".listing-title")
# Extract the name of the bar
wine_shops <- html_text(wine_listing)
wine_shops
```
### Extract the URLs to the websites:
```{r}
# Parse the page for the nodes containing the website URLs
websites <- html_nodes(webpage, ".website-button.button")
# Extract the reference
websites_urls <- html_attr(websites, "href")
websites_urls
```
### Create a data frame and save it as csv:
```{r}
# Create the data frame
df_wine_shops <- data.frame(wine_tasting_room = wine_shops,
website = websites_urls)
# write it to csv
write.csv(df_wine_shops, "./data/places_you_will_go.csv", row.names = FALSE)
datatable(df_wine_shops)
```
<br>
# Final Thoughts
Please always check that **the data you are scraping are publicly available data** and that there is no personal or confidential information gathered. Also please **do not overload the web server** you are scraping: when getting a large amount of data, it is often recommended to insert pauses between the requests sent to the web server to let it handle other requests.
<br>
# References and sources
- `rvest` website: https://rvest.tidyverse.org
- CRAN view on web technologies: https://CRAN.R-project.org/view=WebTechnologies
- HTML intro: https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177
- Good definitions and introduction to web basics: https://developer.mozilla.org/en-US/docs/Web
- Scrapping via API; example of `rnoaa`: http://bradleyboehmke.github.io/2016/01/scraping-via-apis.html
- Munzert, Simon. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc, 2015.
- Nolan, Deborah, and Duncan Temple Lang. XML and Web Technologies for Data Sciences with R. Use R! New York, NY: Springer New York, 2014. doi:10.1007/978-1-4614-7900-0.