092_smoothing.Rmd

<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")

# 9.2. Smoothing

```{r packages, message = FALSE, warning = FALSE}
# Load packages.
packages <- c("changepoint", "downloader", "ggplot2", "MASS", "reshape", "splines", "XML")
packages <- lapply(packages, FUN = function(x) {
  if(!require(x, character.only = TRUE)) {
    install.packages(x)
    library(x, character.only = TRUE)
  }
})
```

This example is based on online ratings of TV shows. It was first coded at [Diffuse Prior][dp-geoscpt], and Andy at [Premier Soccer Stats][pss-geoscpt] have even coded it as an [interactive graph][pss-shiny] with [Shiny][shiny]. We'll start by scraping the data [from GEOS][tww-geos]. The series under scrutiny is Aaron Sorkin's _The West Wing_.

[dp-geoscpt]: http://diffuseprior.wordpress.com/2013/04/30/kalkalash-pinpointing-the-moments-the-simpsons-became-less-cromulent/
[pss-geoscpt]: http://www.premiersoccerstats.com/wordpress/?p=1380
[pss-shiny]: http://glimmer.rstudio.com/pssguy/TVShowRatings/
[shiny]: http://www.rstudio.com/shiny/
[tww-geos]: http://www.geos.tv/index.php/list?sid=179&collection=all

```{r tww-data}
file = "data/geos.tww.csv"
if(!file.exists(file)) {
  # Parse HTML content.
  html <- htmlParse("http://www.geos.tv/index.php/list?sid=179&collection=all")
  # Select table on id.
  html <- xpathApply(html, "//table[@id='collectionTable']")[[1]]
  # Convert to dataset.
  data <- readHTMLTable(html)
  # First data rows.
  head(data)
  # Save local copy.
  write.csv(data[, -3], file, row.names = FALSE)
}
# Read local copy.
data <- read.csv(file, stringsAsFactors = FALSE)
# Check result.
str(data)
```

`Mean` is the average rating of each episode, so we have a parameter, and `Count` is the number of votes on each episode, so we have a sample size. Using the equation for the standard error, $SE = \frac{SD}{\sqrt{N}}$, we will calculate the "margin of error" of each rating. Note that the distribution of the ratings is not normal due to a few episodes having received very high ratings.

```{r tww-se}
# Convert means from text.
data$mu <- as.numeric(substr(data$Mean, 0, 4))
# Compute standard errors.
data$se <- with(data, sd(mu) / sqrt(as.numeric(Count)))
```

The last step that we take with the data is to add the season number to be able to discriminate them visually later on. Each season of _The West Wing_ has 22 episodes, except for two special cases (the final season has 23 episodes and one show is a film special). We use the remainder of a division by 22 to compute seasons, fix the special cases, and factor the variable for plotting purposes.

```{r tww-seasons}
# Compute season.
data$season <- 1 + (data$X - 1) %/% 22
# Fix special cases.
data$season[which(data$season > 7)] <- c(7, NA)
# Factor variable.
data$season <- factor(data$season)
```

The final plot uses 95% and 99% confidence intervals to visualize (some of the) uncertainty.

```{r tww-plot-auto, fig.width = 12, fig.height = 9, tidy = FALSE}
g = qplot(data = data, x = X, y = mu, colour = season, geom = "point") + 
  geom_linerange(aes(ymin = mu - 1.96*se, ymax = mu + 1.96*se), alpha = .5) +
  geom_linerange(aes(ymin = mu - 2.58*se, ymax = mu + 2.58*se), alpha = .5) +
  scale_colour_brewer("Season", palette = "Set1") +
  scale_x_continuous(breaks = seq(1, 156, 22)) +
  theme(legend.position = "top") +
  labs(y = "Mean rating", x = "Episode")
g
```

The plot would be more useful with average ratings per season, which are easy to retrieve with the `ddply()` function. The minimum and maximum episode numbers are also computed to be able to plot horizontal segments for each season. Since we saved the previous plot as a `ggplot2` object, adding the lines only requires to code the extra graphical element and add it on top of the saved elements.

```{r tww-means-auto, fig.width = 12, fig.height = 9, tidy = FALSE}
# Compute season means.
means <- ddply(data, .(season), summarise, 
      mean = mean(mu), 
      xmin = min(X), 
      xmax = max(X))
# Add means to plot.
g + geom_segment(data = means, 
                 aes(x = xmin, xend = xmax, y = mean, yend = mean))
```

## Changepoints

If your aim is to find abrupt variations in the series, use a changepoint algorithm, like the one below from the [`changepoint][cpt] library.

[cpt]: http://cran.r-project.org/web/packages/changepoint/index.html

```{r tww-cpt-auto, fig.width = 12, fig.height = 9, tidy = FALSE}
# Compute changepoints with PELT algorithm.
cpt <- cpt.mean(data$mu, method = 'PELT')
# Extract results.
seg <- data.frame(cpt = attr(cpt, "param.est"))
seg$xmax <- attr(cpt, "cpts")
seg$xmin <- c(0, seg$xmax[-length(seg$xmax)])
# Plot.
g + geom_segment(data = seg, 
                 aes(x = xmin, xend = xmax, y = mean, yend = mean), 
                 color = "black")
```

By the way, if you find a way to predict this kind of data, [let Netflix know][nfp] and [submit your work early][nfp-nyt].

[nfp]: http://www.netflixprize.com/
[nfp-nyt]: bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/

## Smoothing: LOESS and cubic splines

Let's now smooth the series. The base plots will all use the same data, to which we are going to overlay a trendline computed through different methods.

```{r smoothers-base, tidy = FALSE}
# General plot function.
g = qplot(data = data,
          x = X,
          y = mu,
          alpha = I(0.5),
          geom = "line") + 
  scale_x_continuous(breaks = seq(1, 156, 22)) +
  labs(y = "Mean rating", x = "Episode")
```

The simplest ways to smooth the data is to use local polynomials, which we apply to the scores of each season, and then to their detrended values. The more complex smoothing method uses [splines][jm-splines] and is inspired by [Kieran Healy's code][kjh-guns] for state-by-state homicide trends in the USA and other OECD countries. It is used only in the last plot.

[kjh-guns]: https://github.com/kjhealy/assault-deaths
[jm-splines]: http://www.johnmarquess.com/?p=111

```{r smoothers-auto, fig.width = 12, fig.height = 9, message = FALSE, tidy = FALSE}
# LOESS smoother for each season.
g + 
  geom_smooth(se = FALSE) + 
  geom_hline(y = mean(data$mu), linetype = "dashed") +
  aes(colour = season)
# LOESS smoother for the detrended values.
g + 
  geom_smooth(se = FALSE) + 
  geom_hline(aes(y = mean(diff(mu))), linetype = "dashed") +
  aes(colour = season, y = c(0, diff(mu)))
# Cubic splines for the full series.
g + 
  geom_smooth(method = "rlm", se = FALSE, formula = y ~ ns(x, 8)) +
  geom_hline(y = mean(data$mu), linetype = "dashed")
```

The information provided by the cubic splines is almost identical to what we learnt from the changepoints: the series has three periods, due to one drop in audience ratings.

```{r splines-cpt-auto, fig.width = 12, fig.height = 9, message = FALSE, tidy = FALSE}
# Cubic splines and changepoints.
g + 
  geom_smooth(method = "rlm", se = FALSE, formula = y ~ ns(x, 8)) +
  geom_hline(y = mean(data$mu), linetype = "dashed") + 
  geom_segment(data = seg, 
               aes(x = xmin, xend = xmax, y = mean, yend = mean), 
               color = "black")
```

> __Next__: [Practice](093_practice.html).