111Scraping_Workthrough.Rmd

---
output:
  html_document:
    toc: true
    toc_depth: 4
---

![](https://i.ytimg.com/vi/8_5TCqHVEW8/maxresdefault.jpg)


We are starting in this lecture, and end in the next one.

The goal is to build a little collection of songs from our own preferred artist. Let's say, it's _Straight Line Stitch_ (they are great!). A little kicker for the [morning](https://www.youtube.com/watch?v=4_5VAKdHMek).


The **highly suggested** browser (or, at least, the one that I'll be using) is [Firefox](https://www.mozilla.org/en-US/firefox/developer/), the developer edition.

## Packages

> Don't be afraid of the dark you're still held up by the stars


We are going to use a bunch of the usual packages:

```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(magrittr)
library(purrr)
library(glue)
library(stringr)
```


and introduce a new one:

```{r message=FALSE, warning=FALSE}
library(rvest)
library(xml2)
```

which is meant explicitly to scrape stuff from a webpage. We are going to use a couple more in the bonus section, if we get there.

## The lyrics

We are going to extract the lyrics from here: https://www.musixmatch.com/ . Chose it because it's rather consistent, and it's from Bologna, Italy (yeah!).

The webiste offers the first 15 lyrics up front. That will do for the moment (and fixing that is not that easy). Let's take a look [here](https://www.musixmatch.com/artist/Straight-Line-Stitch#).

## Titles

First thing first, we would like to get a list of those title. Let's see how.

```{r}
url_titles <- "https://www.musixmatch.com/artist/Straight-Line-Stitch#"

page_title <- read_html(url_titles)
```


Now, what is this `page_title` object?

let's see:

```{r}
page_title
```

OK. It's a document. Thanks. And it's an XML document. That's sort of html. We'll handle it with `xml2` and `rvest`. Let's see a bit more of that page.

```{r}
page_title %>% html_structure()
```

Wait, whaaaaaat?

![](https://media.giphy.com/media/ZkEXisGbMawMg/giphy.gif)

To the browser! Look at that "class" tags: they are _css selectors_, and we will use them as handles to navigate into the extremely complex list that we get from a web page.

Sometimes, we can be lucky. For example, the css selector for the titles are in the class ".title". Let's see.

```{r}
page_title %>%
  html_nodes(".title")
```

That's still quite a mess: we have too much stuff, such as some links (called "href") and more text than we need. Let's clean it up with `html_text()`


```{r}
page_title %>%
  html_nodes(".title") %>%
  html_text()
```

Wundebar! Now we have 15 song titles. But we want the lyrics! Let's do better.

```{r}
SLS_df <- data_frame(Band = "Straight Line Stitch",
                     Title = page_title %>%
                       html_nodes(".title") %>%
                       html_text())
```


Now we are going to use a bit of string magic

```{r}

SLS_lyrics <- SLS_df %>% mutate(Link = glue('https://www.musixmatch.com/lyrics/{Band}/{Title}') %>%
                           str_replace_all(" ","-"))
```

It seems it works.


There is a better trick to do this job. If we look again at what we get when we select the `.title` you may see that the _actual_ link is there, coded as `href`. Can we extract that? Yes we can!

```{r}
page_title %>%
  html_nodes(".title") %>%
  html_attrs() %>%
  glimpse()
```

In particular, we want the element called `href`. Hey, we can get that with `map`!

```{r}
page_title %>%
  html_nodes(".title") %>%
  html_attrs() %>%
  map_chr("href")
```

Or, even better, by letting `rves` do the job for us:

```{r}
page_title %>%
  html_nodes(".title") %>%
  html_attr("href")
```


```{r}
SLS_df %<>%
  mutate(Link = page_title %>%
  html_nodes(".title") %>%
  html_attr("href"))
```


Cool, we don't gain much in terms of line of code, but it will be usefull later!

## And `purrr`!

Cool, now we want to put grab all lyrics. Let's start with one at a time. What is the url we want?

```{r}
url_song <- glue("https://www.musixmatch.com{SLS_df$Link[1]}")

url_song
```

And let's grab the lyrics for that song. The content is marked by a css selector called "p.mxm-lyrics__content". That stands for "p", an object of class paragraph, plus "mxm-lyrics__content", the specific class for the lyrics.

```{r}
url_song %>%
  read_html() %>%
  html_nodes(".mxm-lyrics__content") %>%
  html_text()
```

Ach, notice that it comes in different blocks: one for each section of text, broken by the advertisment. Well, we can just `collapse()` them together with `glue`. As we are doing this, let's turn that flow into a function:

```{r}
get_lyrics <- function(link){
  
  lyrics_chunks <- glue("https://www.musixmatch.com{link}#") %>%
   read_html() %>%
   html_nodes(".mxm-lyrics__content")
  
  # we do a sanity check to see that there's something inside the lyrics!
  stopifnot(length(lyrics_chunks) > 0)
  
  lyrics <- lyrics_chunks %>%
   html_text() %>%
   collapse(sep = "\n")
  
  return(lyrics)
}
```

Let's test it!

```{r}
SLS_df$Link[3] %>%
  get_lyrics() %>%
  glue() # we paste into glue to get the nice formatting
```

Now we can use purrr to map that function over our dataframe!

```{r}
SLS_df %<>%
  mutate(Lyrics = map_chr(Link, get_lyrics))
```

Ok, here we were quite lucky, as all the links were right. In general we may want to play safe, and use a `possibly` wrapper so not to have to stop everything in case something bad happens.

## The flow

**Explore, try, test, automatize, test.**

Scraping data from the web will require a lot of trial and error. In general, I like this flow: I explore the pages that I want to scrape, trying to identify patterns that I can exploit. Then I try, on a smaller subset, and I test if it worked. Then I automatize it, using `purrr` or something similar. And finally some more testing.

## Another Artist

Let's do this for Angel Haze. Notice that here we **have** to use the attributes from the web page, as the name of the authors of the lyrics is not always the same (the `glue` approach would fail).

```{r}
AH_url <- "https://www.musixmatch.com/artist/Angel-Haze"

AH_lyrics <- data_frame(Band = "Angel Haze",
                        
                         Title = AH_url %>%
                          read_html() %>%
                           html_nodes(css = ".title") %>%
                           html_text(),
                        
                         Link = AH_url %>%
                          read_html() %>%
                           html_nodes(css = ".title") %>%
                          html_attr("href"),
                        
                        Lyrics = map_chr(Link,get_lyrics))
```

### Bonus: sentiment analysis

The idea is to attribute to each word a score, expressing wether it's more negative and positive, and then to sum up. To do this, we are going to use Julia Silge's and David Robinson's great [_Tidytext_](https://github.com/juliasilge/tidytext) library and a _vocabulary_ of words for which we have the scores (there are different options, we are using "afinn").

```{r}
library(tidytext)
afinn <- get_sentiments("afinn")
```

Now, a bit of massaging: we breaks the lyrics into their words, remove the words that are considered not interesting (they are called "stop words"), stitch the dataframe to the scoress from afinn, and do the math for each song.

```{r}
SLS_df %>%
  unnest_tokens(word, Lyrics) %>% #split words
  anti_join(stop_words, by = "word") %>% #remove dull words
  inner_join(afinn, by = "word") %>% #stitch scores
  group_by(Title) %>% #and for each song
  summarise(Length = n(), #do the math
    Score = sum(score)/Length) %>%
  arrange(-Score)
```

So, what was the most positive song?

```{r}
SLS_df %>%
  filter(Title == "Promise Me") %$%
  Lyrics %>%
  glue()
```

And we can easily do the same with Angela Haze:

```{r}
AH_lyrics %>%
  unnest_tokens(word, Lyrics) %>% #split words
  anti_join(stop_words, by = "word") %>% #remove dull words
  inner_join(afinn, by = "word") %>% #stitch scores
  group_by(Title) %>% #and for each song
  summarise(Length = n(), #do the math
    Score = sum(score)/Length) %>%
  arrange(-Score)
```


More resources about Sentiment Analysis (with Tidytext) are available [here](http://varianceexplained.org/r/yelp-sentiment/) and [here](http://www.jakubglinka.com/2017-03-01-text_mining_part1/).

## What about the rest?

We want to do it also for other artists. Best things is to turn some of those scripts into functions. Let's try with _Billie Holiday_ and _A Tribe Called Red_ (I picked them 'cause they are great, and also because they will show some limitations of the code I'm interested to tackle).

When we are about to do something over over, it's better to write functions. So, let's do it!

```{r}
get_words <- function(band_name){

  # remove white space from band name
  collapsed_name <- str_replace_all(band_name, " ", "-") 
  
  # define url to get the title and links
  url <- glue("https://www.musixmatch.com/artist/{collapsed_name}")
  
  # read title page and extract the title chunks 
  title_page <- url %>%
    read_html() %>%
    html_nodes(css = ".title")
  
  # and build the dataframe
  lyrics <- data_frame(Band = band_name,
                       # extract text title
                       Title = title_page %>%
                           html_text(),
                       # extract title link
                       Link =  title_page %>%
                          html_attr("href"),
                       # map to get lyrics
                       Lyrics = map_chr(Link,get_lyrics))
  
  return(lyrics)
}
```

And the sentiment analysis:

```{r}
get_soul <- function(Lyrics_df) {
  Lyrics_df %>%
    unnest_tokens(word, Lyrics) %>% #split words
    #anti_join(stop_words, by = "word") %>% #remove dull words
    inner_join(afinn, by = "word") %>% #stitch scores
    group_by(Title) %>% #and for each song
    summarise(Length = n(), #do the math
      Score = sum(score)/Length) %>%
    arrange(-Score) %>%
    return()
}
```


Let's see if it works:

```{r}
Billie_words <- "Billie Holiday" %>% get_words()
Billie_sentiment <- Billie_words %>% get_soul()
```

Most positive song:

<iframe width="560" height="315" src="https://www.youtube.com/embed/HxG6K59FUGQ" frameborder="0" allowfullscreen></iframe>

and most negative
<iframe width="560" height="315" src="https://www.youtube.com/embed/EIgVCU19pjg" frameborder="0" allowfullscreen></iframe>

It works for me! Let's finally with _A Tribe Called Red_.

```{r, eval=FALSE}
ATCR_words <- "A Tribe Called Red" %>% get_words()
```

Uuuh, we get an error now! What the problem? Where, let's try to see. Some of the songs do not have lyrics, yet! So, when we try to scrape this [page](https://www.musixmatch.com/lyrics/A-Tribe-Called-Red-ft-Black-Bear/Stadium-Pow-Wow) we get an error, as the text is not there.

This is a rather common situation when scraping, as often what we are looking for is not there. Thus, we need a more safe approach. We can either write ad hoc `if ...else ...` statements, to control for the presence/absence of things, or (and it is better to do it anyhow) wrap our function into `purrr::possibly()` construct. We can do it by modifying just slightly our workflow. Notice that now we use `get_lyrics_safe()` inside the mapping instead of `get_lyrics()`.

```{r}
get_lyrics_safe <- purrr::possibly(get_lyrics,NA_character_)

get_words <- function(band_name){

  # remove white space from band name
  collapsed_name <- str_replace_all(band_name, " ", "-") 
  
  # define url to get the title and links
  url <- glue("https://www.musixmatch.com/artist/{collapsed_name}")
  
  # read title page and extract the title chunks 
  title_page <- url %>%
    read_html() %>%
    html_nodes(css = ".title")
  
  # and build the dataframe
  lyrics <- data_frame(Band = band_name,
                       # extract text title
                       Title = title_page %>%
                           html_text(),
                       # extract title link
                       Link =  title_page %>%
                          html_attr("href"),
                       # map to get lyrics
                       Lyrics = map_chr(Link,get_lyrics_safe))
  
  return(lyrics)
}
```

Let's try again:

```{r}
ATCR_words <- "A Tribe Called Red" %>% get_words()
```

Much better.

```{r}
ATCR_sentiment <- ATCR_words %>% get_soul()
```


### Challenge 

Another singer you should, should, should listen to is _Militia Vox_. Try to replicate our work with her lyrics. What's the problem? (If you think you get the answer, please discuss with me :-) )


**note**: this workthrough is loosely inspired by Max Humber's [post](https://www.r-bloggers.com/fantasy-hockey-with-rvest-and-purrr/) and David Laing's post [here](https://laingdk.github.io/kendrick-lamar-data-science/). Great things are from them, errors are mine.