ch15.Rmd

---
output:
  bookdown::html_document2:
    fig_caption: yes
editor_options:
  chunk_output_type: console
---

```{r echo = FALSE, cache = FALSE}
# This block needs cache=FALSE to set fig.width and fig.height, and have those
# persist across cached builds.

source("utils.R", local = TRUE)
knitr::opts_chunk$set(
  fig.width = 3.5,
  fig.height = 3.5,
  # Print less for the examples in this chapter
  print_df_rows = c(2, 2)
)
```

Getting Your Data into Shape {#CHAPTER-DATAPREP}
============================

When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn't the case: you'll have to clean up and restructure the data before you can visualize it.

The recipes in this chapter will often use packages from the *tidyverse*. For a little background about the tidyverse, see the introduction section of Chapter \@ref(CHAPTER-R-BASICS). I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.

> **Note**
>
> The `%>%` symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe \@ref(RECIPE-R-BASICS-PIPE).

Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I'll assume that dplyr is already loaded. You can load it with either `library(tidyverse)` as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:

```{r eval=FALSE}
library(dplyr)
```

Data sets in R are most often stored in data frames. They're typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.

Here's the `heightweight` data set:

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
```

It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it's structured by using the `str()` function:

```{r}
str(heightweight)
```

The first column, `sex`, is a factor with two levels, `"f"` and `"m"`, and the other four columns are vectors of numbers (one of them, `ageMonth`, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).

Factors and character vectors behave similarly in ggplot -- the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.


Creating a Data Frame {#RECIPE-DATAPREP-CREATE-DATAFRAME}
---------------------

### Problem

You want to create a data frame from vectors.

### Solution

You can put vectors together in a data frame with `data.frame()`:

```{r}
# Two starting vectors
g <- c("A", "B", "C")
x <- 1:3
dat <- data.frame(g, x)
dat
```

### Discussion

A data frame is essentially a list of vectors and factors. Each vector or factor can be thought of as a column in the data frame.

If your vectors are in a list, you can convert the list to a data frame with the `as.data.frame()` function:

```{r}
lst <- list(group = g, value = x)    # A list of vectors

dat <- as.data.frame(lst)
```

The tidyverse way of creating a data frame is to use `data_frame()` or `as_data_frame()` (note the underscores instead of periods). This returns a special kind of data frame -- a *tibble* -- which behaves like a regular data frame in most contexts, but prints out more nicely and is specifically designed to play well with the tidyverse functions.

```{r}
data_frame(g, x)
```

```{r eval=FALSE}
# Convert the list of vectors to a tibble
as_data_frame(lst)
```

A regular data frame can be converted to a tibble using `as_tibble()`:

```{r}
as_tibble(dat)
```


Getting Information About a Data Structure {#RECIPE-DATAPREP-INFO-DATA}
------------------------------------------

### Problem

You want to find out information about an object or data structure.

### Solution

Use the `str()` function:

```{r}
str(ToothGrowth)
```

This tells us that `ToothGrowth` is a data frame with three columns, `len`, `supp`, and `dose`. `len` and `dose` contain numeric values, while `supp` is a factor with two levels.

Another useful function is the `summary()` function:

```{r}
summary(ToothGrowth)
```

Instead of showing you the first few values of each column as `str()` does, `summary()` provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.

### Discussion

The `str()` function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.

When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run `str()` on the data frame, or print out the column by itself:

```{r}
tg <- ToothGrowth
tg$supp <- as.character(tg$supp)
str(tg)
```

```{r}
# Print out the columns by themselves
# From old data frame (factor)
ToothGrowth$supp
# From new data frame (character)
tg$supp
```


Adding a Column to a Data Frame {#RECIPE-DATAPREP-ADD-COL}
-------------------------------

### Problem

You want to add a column to a data frame.

### Solution

Use `mutate()` from dplyr to add a new column and assign values to it. This returns a new data frame, which you'll typically want save over the original.

If you assign a single value to the new column, the entire column will be filled with that value. This adds a column named `newcol`, filled with `NA`:

```{r}
library(dplyr)

ToothGrowth %>%
  mutate(newcol = NA)
```

You can also assign a vector to the new column:

```{r}
# Since ToothGrowth has 60 rows, we must create a new vector that has 60 rows
vec <- rep(c(1, 2), 30)

ToothGrowth %>%
  mutate(newcol = vec)
```

Note that the vector being added to the data frame must either have one element, or the same number of elements as the data frame has rows. In the example above we created a new vector that had 60 rows by repeating the values `c(1, 2)` thirty times.

### Discussion

Each column of a data frame is a vector. R handles columns in data frames slightly differently from standalone vectors because all the columns in a data frame must have the same length.

To add a column using base R, you can simply assign values into the new column like so:

```{r}
# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

# Assign NA's for the whole column
ToothGrowth2$newcol <- NA

# Assign 1 and 2, automatically repeating to fill
ToothGrowth2$newcol <- c(1, 2)
```

With base R, the vector being assigned into the data frame will automatically be repeated to fill the number of rows in the data frame.


Deleting a Column from a Data Frame {#RECIPE-DATAPREP-DELETE-COL}
-----------------------------------

### Problem

You want to delete a column from a data frame. This returns a new data frame, which you'll typically want save over the original.

### Solution

Use `select()` from dplyr and specify the columns you want to drop by using `-` (a minus sign).

```{r eval=FALSE}
# Remove the len column
ToothGrowth %>%
  select(-len)
```

### Discussion

You can list multiple columns that you want to drop at the same time, or conversely specify only the columns that you want to keep. The following two pieces of code are thus equivalent:

```{r eval=FALSE}
# Remove both len and supp from ToothGrowth
ToothGrowth %>%
  select(-len, -supp)

# This keeps just dose, which has the same effect for this data set
ToothGrowth %>%
  select(dose)
```

To remove a column using base R, you can simply assign `NULL` to that column.

```{r eval=FALSE}
ToothGrowth$len <- NULL
```

### See Also

Recipe \@ref(RECIPE-DATAPREP-SUBSET) for more on getting a subset of a data frame.

See `?select` for more ways to drop and keep columns.


Renaming Columns in a Data Frame {#RECIPE-DATAPREP-RENAME-COL}
--------------------------------

### Problem

You want to rename the columns in a data frame.

### Solution

Use `rename()` from dplyr. This returns a new data frame:

```{r eval=FALSE}
ToothGrowth %>%
  rename(length = len)
```

### Discussion

You can rename multiple columns within the same call to `rename()`:

```{r}
ToothGrowth %>%
  rename(
    length = len,
    supplement_type = supp
  )
```

Renaming a column using base R is a bit more verbose. It uses the `names()` function on the left side of the `<-` operator.


```{r}
# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

names(ToothGrowth2)  # Print the names of the columns

# Rename "len" to "length"
names(ToothGrowth2)[names(ToothGrowth2) == "len"] <- "length"

names(ToothGrowth)
```

### See Also

See `?select` for more ways to rename columns within a data frame.


Reordering Columns in a Data Frame {#RECIPE-DATAPREP-REORDER-COL}
----------------------------------

### Problem

You want to change the order of columns in a data frame.

### Solution

Use the `select()` from dplyr.

```{r}
ToothGrowth %>%
  select(dose, len, supp)
```

The new data frame will contain the columns you specified in `select()`, in the order you specified. Note that `select()` returns a new data frame, so if you want to change the original variable, you'll need to save the new result over it.

### Discussion

If you are only reordering a few variables and want to keep the rest of the variables in order, you can use `everything()` as a placeholder:

```{r}
ToothGrowth %>%
  select(dose, everything())
```

See `?select_helpers` for other ways to select columns. You can, for example, select columns by matching parts of the name.

Using base R, you can also reorder columns by their name or numeric position. This returns a new data frame, which can be saved over the original.

```{r eval=FALSE}
ToothGrowth[c("dose", "len", "supp")]

ToothGrowth[c(3, 1, 2)]
```

In these examples, I used list-style indexing. A data frame is essentially a list of vectors, and indexing into it as a list will return another data frame. You can get the same effect with matrix-style indexing:


```{r eval=FALSE}
ToothGrowth[c("dose", "len", "supp")]   # List-style indexing

ToothGrowth[, c("dose", "len", "supp")] # Matrix-style indexing
```

In this case, both methods return the same result, a data frame. However, when retrieving a single column, list-style indexing will return a data frame, while matrix-style indexing will return a vector:

```{r}
ToothGrowth["dose"]
ToothGrowth[, "dose"]
```

You can use `drop=FALSE` to ensure that it returns a data frame:

```{r}
ToothGrowth[, "dose", drop=FALSE]
```


Getting a Subset of a Data Frame {#RECIPE-DATAPREP-SUBSET}
--------------------------------

### Problem

You want to get a subset of a data frame.

### Solution

Use `filter()` to get the rows, and `select()` to get the columns you want. These operations can be chained together using the `%>%` operator. These functions return a new data frame, so if you want to change the original variable, you'll need to save the new result over it.

We'll use the `climate` data set for the examples here:

```{r}
library(gcookbook) # Load gcookbook for the climate data set
climate
```

Let's that say that only want to keep rows where `Source` is `"Berkeley"` and where the year is inclusive of and between 1900 and 2000. You can do so with the `filter()` function:

```{r eval=FALSE}
climate %>%
  filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000)
```

If you want only the `Year` and `Anomaly10y` columns, use `select()`, as we did in \@ref(RECIPE-DATAPREP-DELETE-COL):

```{r}
climate %>%
  select(Year, Anomaly10y)
```

These operations can be chained together using the `%>%` operator:

```{r}
climate %>%
  filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000) %>%
  select(Year, Anomaly10y)
```

### Discussion

The `filter()` function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the `slice()` function:

```{r eval=FALSE}
slice(climate, 1:100)
```

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you're collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

```{r}
climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, ]
```

Notice that we needed to prefix each column name with `climate$`, and that there's a comma after the selection criteria. This indicates that we're getting rows, not columns.

This row filtering can also be combined with the column selection from \@ref(RECIPE-DATAPREP-DELETE-COL):

```{r}
climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000,
        c("Year", "Anomaly10y")]
```


Changing the Order of Factor Levels {#RECIPE-DATAPREP-FACTOR-REORDER}
-----------------------------------

### Problem

You want to change the order of levels in a factor.

### Solution

Pass the factor to `factor()`, and give it the levels in the order you want. This returns a new factor, so if you want to change the original variable, you'll need to save the new result over it.


```{r}
# By default, levels are ordered alphabetically
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes

factor(sizes, levels = c("small", "medium", "large"))
```

The order can also be specified with `levels` when the factor is first created:

```{r eval=FALSE}
factor(c("small", "large", "large", "small", "medium"),
       levels = c("small", "medium", "large"))
```


### Discussion

There are two kinds of factors in R: ordered factors and regular factors. (In practice, ordered levels are not commonly used.) In both types, the levels are arranged in *some* order; the difference is that the order is meaningful for an ordered factor, but it is arbitrary for a regular factor -- it simply reflects how the data is stored. For plotting data, the distinction between ordered and regular factors is generally unimportant, and they can be treated the same.

The order of factor levels affects graphical output. When a factor variable is mapped to an aesthetic property in ggplot, the aesthetic adopts the ordering of the factor levels. If a factor is mapped to the x-axis, the ticks on the axis will be in the order of the factor levels, and if a factor is mapped to color, the items in the legend will be in the order of the factor levels.

To reverse the level order, you can use `rev(levels())`:

```{r eval=FALSE}
factor(sizes, levels = rev(levels(sizes)))
```

The tidyverse function for reordering factors is `fct_relevel()` from the forcats package. It has a syntax similar to the `factor()` function from base R.

```{r}
# Change the order of levels
library(forcats)
fct_relevel(sizes, "small", "medium", "large")
```


### See Also

To reorder a factor based on the value of another variable, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE).

Reordering factor levels is useful for controlling the order of axes and legends. See Recipes Recipe \@ref(RECIPE-AXIS-ORDER) and Recipe \@ref(RECIPE-LEGEND-ORDER) for more information.


Changing the Order of Factor Levels Based on Data Values {#RECIPE-DATAPREP-FACTOR-REORDER-VALUE}
--------------------------------------------------------

### Problem

You want to change the order of levels in a factor based on values in the data.

### Solution

Use `reorder()` with the factor that has levels to reorder, the values to base the reordering on, and a function that aggregates the values:

```{r}
# Make a copy of the InsectSprays data set since we're modifying it
iss <- InsectSprays
iss$spray

iss$spray <- reorder(iss$spray, iss$count, FUN = mean)
iss$spray
```

Notice that the original levels were `ABCDEF`, while the reordered levels are `CEDABF`. What we've done is reorder the levels of `spray` based on the mean value of `count` for each level of `spray`.

### Discussion

The usefulness of `reorder()` might not be obvious from just looking at the raw output. Figure \@ref(fig:FIG-DATAPREP-FACTOR-REORDER-VALUE) shows three plots made with `reorder()`. In these plots, the order in which the items appear is determined by their values.

```{r FIG-DATAPREP-FACTOR-REORDER-VALUE, echo=FALSE, fig.show="hold", fig.cap="Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)", fig.height=2.5, fig.width=3}
ggplot(InsectSprays, aes(spray, count)) +
  geom_boxplot()

ggplot(InsectSprays, aes(reorder(spray, count, FUN = mean), count)) +
  geom_boxplot()

ggplot(InsectSprays, aes(reorder(spray, count, FUN = median), count)) +
  geom_boxplot()
```

In the middle plot in Figure \@ref(fig:FIG-DATAPREP-FACTOR-REORDER-VALUE), the boxes are sorted by the mean. The horizontal line that runs across each box represents the *median* of the data. Notice that these values do not increase strictly from left to right. That's because with this particular data set, sorting by the mean gives a different order than sorting by the median. To make the median lines increase from left to right, as in the plot on the right in Figure \@ref(fig:FIG-DATAPREP-FACTOR-REORDER-VALUE), we used the `median()` function in `reorder()`.

The tidyverse function for reordering factors is `fct_reorder()`, and it is used the same way as `reorder()`. These do the same thing:

```{r eval=FALSE}
reorder(iss$spray, iss$count, FUN = mean)
fct_reorder(iss$spray, iss$count, .fun = mean)
```

### See Also

Reordering factor levels is also useful for controlling the order of axes and legends. See Recipes \@ref(RECIPE-AXIS-ORDER) and \@ref(RECIPE-LEGEND-ORDER) for more information.


Changing the Names of Factor Levels {#RECIPE-DATAPREP-FACTOR-RENAME}
-----------------------------------

### Problem

You want to change the names of levels in a factor.

### Solution

Use `fct_recode()` from the forcats package

```{r}
sizes <- factor(c( "small", "large", "large", "small", "medium"))
sizes

# Pass it a named vector with the mappings
fct_recode(sizes, S = "small", M = "medium", L = "large")
```

### Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()` with `fct_recode()`.

```{r}
old <- c("small", "medium", "large")
new <- c("S", "M", "L")

# Create a named vector that has the mappings between old and new
mappings <- setNames(old, new)
mappings

# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)

# Look at the structure of the list
str(args)

# Use do.call to call fct_recode with the arguments
do.call(fct_recode, args)
```

Or, more concisely, we can do all of that in one go:

```{r}
do.call(
  fct_recode,
  c(list(sizes), setNames(c("small", "medium", "large"), c("S", "M", "L")))
)
```

For a more traditional (and clunky) base R method for renaming factor levels, use the `levels()<-` function:

```{r}
sizes <- factor(c( "small", "large", "large", "small", "medium"))

# Index into the levels and rename each one
levels(sizes)[levels(sizes) == "large"]  <- "L"
levels(sizes)[levels(sizes) == "medium"] <- "M"
levels(sizes)[levels(sizes) == "small"]  <- "S"
sizes
```

If you are renaming *all* your factor levels, there is a simpler method. You can pass a list to `levels()<-`:

```{r}
sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes) <- list(S = "small", M = "medium", L = "large")
sizes
```

With this method, all factor levels must be specified in the list; if any are missing, they will be replaced with `NA`.

It's also possible to rename factor levels by position, but this is somewhat inelegant:

```{r}
sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes)[1] <- "L"
sizes

# Rename all levels at once
levels(sizes) <- c("L", "M", "S")
sizes
```

It's safer to rename factor levels by name rather than by position, since you will be less likely to make a mistake (and mistakes here may be hard to detect). Also, if your input data set changes to have more or fewer levels, the numeric positions of the existing levels could change, which could cause serious but nonobvious problems for your analysis.

### See Also

If, instead of a factor, you have a character vector with items to rename, see Recipe \@ref(RECIPE-DATAPREP-CHARACTER-RENAME).


Removing Unused Levels from a Factor {#RECIPE-DATAPREP-FACTOR-DROPLEVELS}
------------------------------------

### Problem

You want to remove unused levels from a factor.

### Solution

Sometimes, after processing your data you will have a factor that contains levels that are no longer used. Here's an example:

```{r}
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes <- sizes[1:3]
sizes
```

To remove them, use `droplevels()`:

```{r}
droplevels(sizes)
```

### Discussion

The `droplevels()` function preserves the order of factor levels. You can use the `except` parameter to keep particular levels.

The tidyverse way: Use `fct_drop()` from the forcats package:

```{r}
fct_drop(sizes)
```


Changing the Names of Items in a Character Vector {#RECIPE-DATAPREP-CHARACTER-RENAME}
-------------------------------------------------

### Problem

You want to change the names of items in a character vector.

### Solution

Use `recode()` from the dplyr package:

```{r}
library(dplyr)

sizes <- c("small", "large", "large", "small", "medium")
sizes

# With recode(), pass it a named vector with the mappings
recode(sizes, small = "S", medium = "M", large = "L")

# Can also use quotes -- useful if there are spaces or other strange characters
recode(sizes, "small" = "S", "medium" = "M", "large" = "L")
```

### Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()` with `fct_recode()`.

```{r}
old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(new, old)
mappings

# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
# Use do.call to call fct_recode with the arguments
do.call(recode, args)
```

Or, more concisely, we can do all of that in one go:

```{r}
do.call(
  recode,
  c(list(sizes), setNames(c("S", "M", "L"), c("small", "medium", "large")))
)
```

Note that for `recode()`, the name and value of the arguments is reversed, compared to the `fct_recode()` function from the forcats package. With `recode()`, you would use `small="S"`, whereas for `fct_recode()`, you would use `S="small"`.

A more traditional R method is to use square-bracket indexing to select the items and rename them:

```{r}
sizes <- c("small", "large", "large", "small", "medium")
sizes[sizes == "small"]  <- "S"
sizes[sizes == "medium"] <- "M"
sizes[sizes == "large"]  <- "L"
sizes
```

### See Also

If, instead of a character vector, you have a factor with levels to rename, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-RENAME).


Recoding a Categorical Variable to Another Categorical Variable {#RECIPE-DATAPREP-RECODE-CATEGORICAL}
---------------------------------------------------------------

### Problem

You want to recode a categorical variable to another variable.

### Solution

For the examples here, we'll use a subset of the `PlantGrowth` data set:

```{r}
# Work on a subset of the PlantGrowth data set
pg <- PlantGrowth[c(1,2,11,21,22), ]
pg
```

In this example, we'll recode the categorical variable group into another categorical variable, treatment. If the old value was `"ctrl"`, the new value will be `"No"`, and if the old value was `"trt1"` or `"trt2"`, the new value will be `"Yes"`.

This can be done with the `recode()` function from the dplyr package:

```{r}
library(dplyr)

recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
```

You can assign it as a new column in the data frame:

```{r eval=FALSE}
pg$treatment <- recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
```

Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use `as.character()`:

```{r}
recode(as.character(pg$group), ctrl = "No", trt1 = "Yes", trt2 = "Yes")
```


### Discussion

You can also use the `fct_recode()` function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:

```{r}
library(forcats)
fct_recode(pg$group, No = "ctrl", Yes = "trt1", Yes = "trt2")
```

Another difference is that `fct_recode()` will always return a factor, whereas `recode()` will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a `recode_factor()` function which also always returns a factor.)


Using base R, recoding can be done with the `match()` function:

```{r}
oldvals <- c("ctrl", "trt1", "trt2")
newvals <- factor(c("No", "Yes", "Yes"))

newvals[ match(pg$group, oldvals) ]
```

It can also be done by indexing in the vectors:

```{r echo=FALSE}
# Reset the data
pg <- PlantGrowth[c(1,2,11,21,22), ]
```

```{r}
pg$treatment[pg$group == "ctrl"] <- "No"
pg$treatment[pg$group == "trt1"] <- "Yes"
pg$treatment[pg$group == "trt2"] <- "Yes"

# Convert to a factor
pg$treatment <- factor(pg$treatment)
pg
```

Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-RENAME).

The coding criteria can also be based on values in multiple columns, by using the `&` and `|` operators:

```{r echo=FALSE}
# Reset the data
pg <- PlantGrowth[c(1,2,11,21,22), ]
```

```{r}
pg$newcol[pg$group == "ctrl" & pg$weight < 5]  <- "no_small"
pg$newcol[pg$group == "ctrl" & pg$weight >= 5] <- "no_large"
pg$newcol[pg$group == "trt1"] <- "yes"
pg$newcol[pg$group == "trt2"] <- "yes"
pg$newcol <- factor(pg$newcol)
pg
```

It's also possible to combine two columns into one using the interaction() function, which appends the values with a `.` in between. This combines the `weight` and `group` columns into a new column, `weightgroup`:

```{r echo=FALSE}
# Reset the data
pg <- PlantGrowth[c(1,2,11,21,22), ]
```

```{r}
pg$weightgroup <- interaction(pg$weight, pg$group)
pg
```

### See Also

For more on renaming factor levels, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-RENAME).

See Recipe \@ref(RECIPE-DATAPREP-RECODE-CONTINUOUS) for recoding continuous values to categorical values.


Recoding a Continuous Variable to a Categorical Variable {#RECIPE-DATAPREP-RECODE-CONTINUOUS}
--------------------------------------------------------

### Problem

You want to recode a continuous variable to another variable.

### Solution

Use the `cut()` function. In this example, we'll use the `PlantGrowth` data set and recode the continuous variable `weight` into a categorical variable, `wtclass`, using the `cut()` function:

```{r}
pg <- PlantGrowth
pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf))
pg
```

### Discussion

For three categories we specify four bounds, which can include `Inf` and `-Inf`. If a data value falls outside of the specified bounds, it's categorized as `NA`. The result of `cut()` is a factor, and you can see from the example that the factor levels are named after the bounds.

To change the names of the levels, set the labels:

```{r}
pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf),
                  labels = c("small", "medium", "large"))
pg
```

As indicated by the factor levels, the bounds are by default *open* on the left and *closed* on the right. In other words, they don't include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting `include.lowest=TRUE`. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as `NA`.

If you want the categories to be closed on the left and open on the right, set right = FALSE:

```{r}
cut(pg$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
```

### See Also

To recode a categorical variable to another categorical variable, see Recipe \@ref(RECIPE-DATAPREP-RECODE-CATEGORICAL).


Calculating New Columns From Existing Columns  {#RECIPE-DATAPREP-CALCULATE}
-----------------------

### Problem

You want to calculate a new column of values in a data frame.

### Solution

Use `mutate()` from the dplyr package.

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
```

This will convert `heightIn` to centimeters and store it in a new column, `heightCm`:

```{r}
library(dplyr)
heightweight %>%
  mutate(heightCm = heightIn * 2.54)
```

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

### Discussion

You can use `mutate()` to transform multiple columns at once:

```{r}
heightweight %>%
  mutate(
    heightCm = heightIn * 2.54,
    weightKg = weightLb / 2.204
  )
```

It is also possible to calculate a new column based on multiple columns:

```{r eval=FALSE}
heightweight %>%
  mutate(bmi = weightKg / (heightCm / 100)^2)
```

With `mutate()`, the columns are added sequentially. That means that we can reference a newly-created column when calculating a new column:

```{r}
heightweight %>%
  mutate(
    heightCm = heightIn * 2.54,
    weightKg = weightLb / 2.204,
    bmi = weightKg / (heightCm / 100)^2
  )
```

With base R, calculating a new colum can be done by referencing the new column with the `$` operator and assigning some values to it:

```{r, eval=FALSE}
heightweight$heightCm <- heightweight$heightIn * 2.54
```

### See Also

See Recipe \@ref(RECIPE-DATAPREP-CALCULATE-GROUP) for how to perform group-wise transformations on data.


Calculating New Columns by Groups {#RECIPE-DATAPREP-CALCULATE-GROUP}
-------------------------------

### Problem

You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.

### Solution

Use `group_by()` from the dplyr package to specify the grouping variable, and then specify the operations in `mutate()`:

```{r}
library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

cabbages %>%
  group_by(Cult) %>%
  mutate(DevWt = HeadWt - mean(HeadWt))
```

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

### Discussion

Let's take a closer look at the `cabbages` data set. It has two grouping variables (factors): `Cult`, which has levels `c39` and `c52`, and `Date`, which has levels `d16`, `d20`, and `d21.` It also has two measured numeric variables, `HeadWt` and `VitC`:

```{r}
cabbages
```

Suppose we want to find, for each case, the deviation of `HeadWt` from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:

```{r}
mutate(cabbages, DevWt = HeadWt - mean(HeadWt))
```

You'll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean *within the group*, where the groups are specified by `Cult`. In these cases, we can use `group_by()` and `mutate()` together:

```{r}
cb <- cabbages %>%
  group_by(Cult) %>%
  mutate(DevWt = HeadWt - mean(HeadWt))
```

First it groups cabbages based on the value of `Cult`. There are two levels of `Cult`, `c39` and `c52`. It then applies the `mutate()` function to each data frame.

The before and after results are shown in Figure \@ref(fig:FIG-DATAPREP-CALCULATE-GROUP):

```{r FIG-DATAPREP-CALCULATE-GROUP, fig.show="hold", fig.cap="Before normalizing (left); After normalizing (right)"}
# The data before normalizing
ggplot(cb, aes(x = Cult, y = HeadWt)) +
  geom_boxplot()

# After normalizing
ggplot(cb, aes(x = Cult, y = DevWt)) +
  geom_boxplot()
```

You can also group the data frame on multiple variables and perform operations on multiple variables. The following code groups the data by `Cult` and `Date`, forming a group for each distinct combination of the two variables. After forming these groups, the code will calculate the deviation of `HeadWt` and `VitC` from the mean of each group:

```{r}
cabbages %>%
  group_by(Cult, Date) %>%
  mutate(
    DevWt = HeadWt - mean(HeadWt),
    DevVitC = VitC - mean(VitC)
  )
```

### See Also

To summarize data by groups, see Recipe \@ref(RECIPE-DATAPREP-SUMMARIZE).


Summarizing Data by Groups {#RECIPE-DATAPREP-SUMMARIZE}
--------------------------

### Problem

You want to summarize your data, based on one or more grouping variables.

### Solution

Use `group_by()` and `summarise()` from the dplyr package, and specify the operations to do:

```{r}
library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    VitC = mean(VitC)
  )
```

### Discussion

There are few things going on here that may be unfamiliar if you're new to dplyr and the tidyverse in general.

First, let's take a closer look at the `cabbages` data set. It has two factors that can be used as grouping variables: `Cult`, which has levels `c39` and `c52`, and `Date`, which has levels `d16`, `d20`, and `d21`. It also has two numeric variables, `HeadWt` and `VitC`:

```{r}
cabbages
```

Finding the overall mean of `HeadWt` is simple. We could just use the `mean()` function on that column, but for reasons that will soon become clear, we'll use the `summarise()` function instead:

```{r}
library(dplyr)
summarise(cabbages, Weight = mean(HeadWt))
```

The result is a data frame with one row and one column, named `Weight`.

Often we want to find information about each subset of the data, as specified by a grouping variable. For example, suppose we want to find the mean of each `Cult` group. To do this, we can use `summarise()` with `group_by()`.

```{r}
tmp <- group_by(cabbages, Cult)
summarise(tmp, Weight = mean(HeadWt))
```

The command first groups the data frame `cabbages` based on the value of `Cult`. There are two levels of `Cult`, `c39` and `c52`, so there are two groups. It then applies the `summarise()` function to each of these data frames; it calculates `Weight` by taking the `mean()` of the `HeadWt` column in each of the sub-data frames. The resulting summaries for each group are assembled into a data frame, which is returned.

You can imagine that the `cabbages` data is split up into two separate data frames, then `summarise()` is called on each data frame (returning a one-row data frame for each), and then those results are combined together into a final data frame. This is actually how things worked in dplyr's predecessor, plyr, with the `ddply()` function.

The syntax of the previous code used a temporary variable to store results. That's a little verbose, so instead, we can use `%>%`, also known as the pipe operator, to chain the function calls together. The pipe operator simply takes what's on its left and substitutes it as the first argument of the function call on the right. The following two lines of code are equivalent:

```{r eval=FALSE}
group_by(cabbages, Cult)
# The pipe operator moves `cabbages` to the first argument position of group_by()
cabbages %>% group_by(Cult)
```

The reason it's called a pipe operator is that it lets you connect function calls together in sequence to form a pipeline of operations. Another common term for this is a different metaphor: *chaining*.

So the first argument of the function call is in a different place. So what? The advantages become apparent when chaining is involved. Here's what it would look like if you wanted to call `group_by()` and then `summarise()` without making use of a temporary variable. Instead of proceeding left to right, the computation occurs from the inside out:

```{r eval=FALSE}
summarise(group_by(cabbages, Cult), Weight = mean(HeadWt))
```

Using a temporary variable, as we did earlier, makes it more readable, but a more elegant solution is to use the pipe operator:

```{r eval=FALSE}
cabbages %>%
  group_by(Cult) %>%
  summarise(Weight = mean(HeadWt))
```

Back to summarizing data. Summarizing the data frame by grouping using more variables (or columns) is simple: just give it the names of the additional variables. It's also possible to get more than one summary value by specifying more calculated columns. Here we'll summarize each `Cult` and `Date` group, getting the average of `HeadWt` and `VitC`:

```{r}
cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    Vitc = mean(VitC)
  )
```

> **Note**
>
> You might have noticed that it says that the result is grouped by `Cult`, but not `Date`. This is because the `summarise()` function removes one level of grouping. This is typically what you want when the input has one grouping variable. When there are multiple grouping variables, this may or may not be the what you want. To remove all grouping, use `ungroup()`, and to add back the original grouping, use `group_by()` again.

It's possible to do more than take the mean. You may, for example, want to compute the standard deviation and count of each group. To get the standard deviation, use `sd()`, and to get a count of rows in each group, use `n()`:

```{r}
cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n()
  )
```

Other useful functions for generating summary statistics include `min()`, `max()`, and `median()`. The `n()` function is a special function that works only inside of the dplyr functions `summarise()`, `mutate()` and `filter()`. See `?summarise` for more useful functions.

The `n()` function gets a count of rows, but if you want to have it *not* count `NA` values from a column, you need to use a different technique. For example, if you want it to ignore any `NA`s in the `HeadWt` column, use `sum(!is.na(Headwt))`.


#### Dealing with NAs

One potential pitfall is that `NA`s in the data will lead to `NA`s in the output. Let's see what happens if we sprinkle a few `NA`s into `HeadWt`:

```{r}
c1 <- cabbages # Make a copy
c1$HeadWt[c(1, 20, 45)] <- NA # Set some values to NA

c1 %>%
  group_by(Cult) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n()
  )
```

The problem is that `mean()` and `sd()` simply return `NA` if any of the input values are `NA.` Fortunately, these functions have an option to deal with this very issue: setting `na.rm=TRUE` will tell them to ignore the `NA`s.

```{r}
c1 %>%
  group_by(Cult) %>%
  summarise(
    Weight = mean(HeadWt, na.rm = TRUE),
    sd = sd(HeadWt, na.rm = TRUE),
    n = n()
  )
```


#### Missing combinations {#_missing_combinations}

If there are any empty combinations of the grouping variables, they will not appear in the summarized data frame. These missing combinations can cause problems when making graphs. To illustrate, we'll remove all entries that have levels `c52` and `d21`. The graph on the left in Figure \@ref(fig:FIG-DATAPREP-SUMMARIZE-MISSING-COMBO) shows what happens when there's a missing combination in a bar graph:

```{r FIG-DATAPREP-SUMMARIZE-MISSING-COMBO-1, eval=FALSE}
# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
  group_by(Cult, Date) %>%
  summarise(Weight = mean(HeadWt))

ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
  geom_col(position = "dodge")
```


To fill in the missing combination (Figure \@ref(fig:FIG-DATAPREP-SUMMARIZE-MISSING-COMBO), right), use the `complete()` function from the tidyr package -- which is also part of the tidyverse. Also, the grouping for `c2a` must be removed, with `ungroup()`; otherwise it will return too many rows.

```{r FIG-DATAPREP-SUMMARIZE-MISSING-COMBO-2, eval=FALSE}
library(tidyr)
c2b <- c2a %>%
  ungroup() %>%
  complete(Cult, Date)

ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
  geom_col(position = "dodge")
```

```{r, FIG-DATAPREP-SUMMARIZE-MISSING-COMBO, ref.label=c("FIG-DATAPREP-SUMMARIZE-MISSING-COMBO-1", "FIG-DATAPREP-SUMMARIZE-MISSING-COMBO-2"), fig.show="hold", fig.cap="Bar graph with a missing combination (left); With missing combination filled (right)", fig.width=4, fig.height=3, warning=FALSE}
```

When we used `complete()`, it filled in the missing combinations with `NA`. It's possible to fill with a different value, with the `fill` parameter. See `?complete` for more information.

### See Also

If you want to calculate standard errors and confidence intervals, see Recipe \@ref(RECIPE-DATAPREP-SUMMARIZE-SE).

See Recipe \@ref(RECIPE-DISTRIBUTION-BOXPLOT-MEAN) for an example of using stat_summary() to calculate means and overlay them on a graph.

To perform transformations on data by groups, see Recipe \@ref(RECIPE-DATAPREP-CALCULATE-GROUP).


Summarizing Data with Standard Errors and Confidence Intervals {#RECIPE-DATAPREP-SUMMARIZE-SE}
--------------------------------------------------------------

### Problem

You want to summarize your data with the standard error of the mean and/or confidence intervals.

### Solution

Getting the standard error of the mean involves two steps: first get the standard deviation and count for each group, then use those values to calculate the standard error. The standard error for each group is just the standard deviation divided by the square root of the sample size:

```{r}
library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

ca <- cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n(),
    se = sd / sqrt(n)
  )

ca
```

### Discussion

The `summarise()` function computes the columns in order, so you can refer to previous newly-created columns. That's why `se` can use the `sd` and `n` columns.

The `n()` function gets a count of rows, but if you want to have it *not* count `NA` values from a column, you need to use a different technique. For example, if you want it to ignore any `NA`s in the `HeadWt` column, use `sum(!is.na(Headwt))`.


#### Confidence Intervals {#_confidence_intervals}

Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the `qt()` function to get the quantile, then multiply that by the standard error. The `qt()` function will give quantiles of the *t*-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped *t*-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.

This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:

```{r}
ciMult <- qt(.975, ca$n - 1)
ciMult
```

Now we can multiply that vector by the standard error to get the 95% confidence interval:

```{r}
ca$ci95 <- ca$se * ciMult
ca
```

This could be done in one line, like this:

```{r eval=FALSE}
ca$ci95 <- ca$se * qt(.975, ca$n - 1)
```

For a 99% confidence interval, use .995.

Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there's a 95% chance that the true population mean is within the interval (actually, it doesn't mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).

This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle `NA`s and missing combinations, with the `na.rm` and `.drop` options. By default, it provides a 95% confidence interval, but this can be set with the `conf.interval` argument:

```{r}
summarySE <- function(data = NULL, measurevar, groupvars = NULL, na.rm = FALSE,
                      conf.interval = .95, .drop = TRUE) {

  # New version of length which can handle NA's: if na.rm==T, don't count them
  length2 <- function(x, na.rm = FALSE) {
    if (na.rm) sum(!is.na(x))
    else       length(x)
  }

  groupvars  <- rlang::syms(groupvars)
  measurevar <- rlang::sym(measurevar)

  datac <- data %>%
    dplyr::group_by(!!!groupvars) %>%
    dplyr::summarise(
      N             = length2(!!measurevar, na.rm = na.rm),
      sd            = sd     (!!measurevar, na.rm = na.rm),
      !!measurevar := mean   (!!measurevar, na.rm = na.rm),
      se            = sd / sqrt(N),
      # Confidence interval multiplier for standard error
      # Calculate t-statistic for confidence interval:
      # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
      ci            = se * qt(conf.interval/2 + .5, N - 1)
    ) %>%
    dplyr::ungroup() %>%
    # Rearrange the columns so that sd, se, ci are last
    dplyr::select(seq_len(ncol(.) - 4), ncol(.) - 2, sd, se, ci)

  datac
}
```

The following usage example has a 99% confidence interval and handles `NA`s and missing combinations:

```{r}
# Remove all rows with both c52 and d21
c2 <- filter(cabbages, !(Cult == "c52" & Date == "d21" ))
# Set some values to NA
c2$HeadWt[c(1, 20, 45)] <- NA
summarySE(c2, "HeadWt", c("Cult", "Date"),
          conf.interval = .99, na.rm = TRUE, .drop = FALSE)
```

### See Also

See Recipe \@ref(RECIPE-ANNOTATE-ERROR-BAR) to use the values calculated here to add error bars to a graph.


Converting Data from Wide to Long {#RECIPE-DATAPREP-WIDE-TO-LONG}
---------------------------------

### Problem

You want to convert a data frame from "wide" format to "long" format.

### Solution

Use `gather()` from the tidyr package. In the `anthoming` data set, for each `angle`, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:


```{r}
library(gcookbook) # For the data set
anthoming
```

We can reshape the data so that all the measurements are in one column. This will put the values from `expt` and `ctrl` into one column, and put the names into a different column:

```{r}
library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
```

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

### Discussion

In the source data, there are *ID* variables and *value* variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when `angle` is –20. In the output data frame, the two measurements, for `expt` and `ctrl`, are no longer in the same row, but we can still tell that they belong together because they have the same value of `angle`.

The value variables are by default all the non-ID variables. The names of these variables are put into a new *key* column, which we called `condition`, and the values are put into a new *value* column which we called `count`.

You can designate the *value* columns from the source data by naming them individually, as we did above with `expt` and `ctrl`. `gather()` automatically inferred that the ID variable was the remaining column, `angle`. Another way to tell it which columns are values is to do the reverse: if you exclude the `angle` column, then `gather()` will infer that the value columns are the remaining ones, `expt` and `ctrl`.

```{r eval=FALSE}
gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)
```

There are other convenient shortcuts to specify which columns are values. For example `expt:ctrl` means to select all columns between `expt` and `ctrl` (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, `gather()` will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you'll need to filter them out first using the `select()` function.

For example, in the `drunk` data set, suppose we want to convert it to long format, keeping `sex` in one column and putting the numeric values in another column. This time, we want the values for only the `0-29` and `30-39` columns, and we want to discard the values for the other age ranges:

```{r}
# Our source data
drunk

# Try gather() with just 0-29 and 30-39
drunk %>%
  gather(age, count, "0-29", "30-39")
```

That doesn't look right! We told `gather()` that `0-29` and `30-39` were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep `sex` and discard the others. The solution is to use `select()` to remove the unwanted columns first, and then `gather()`.

```{r}
library(dplyr)  # For the select() function

drunk %>%
  select(sex, "0-29", "30-39") %>%
  gather(age, count, "0-29", "30-39")
```


There are times where you may want to use use more than one column as the ID variables:

```{r}
plum_wide
# Use length and time as the ID variables (by not naming them as value variables)
gather(plum_wide, "survival", "count", dead, alive)
```

Some data sets don't come with a column with an ID variable. For example, in the `corneas` data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won't be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

```{r}
# Make a copy of the data
co <- corneas
# Add an ID column
co$id <- 1:nrow(co)

gather(co, "eye", "thickness", affected, notaffected)
```

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with `as.character()`, or a factor with `factor()`.

### See Also

See Recipe \@ref(RECIPE-DATAPREP-LONG-TO-WIDE) to do conversions in the other direction, from long to wide.

See the `stack()` function for another way of converting from wide to long.


Converting Data from Long to Wide {#RECIPE-DATAPREP-LONG-TO-WIDE}
---------------------------------

### Problem

You want to convert a data frame from "long" format to "wide" format.

### Solution

Use the `spread()` function from the tidyr package. In this example, we'll use the `plum` data set, which is in a long format:

```{r}
library(gcookbook) # For the data set
plum
```

The conversion to wide format takes each unique value in one column and uses those values as headers for new columns, then uses another column for source values. For example, we can "move" values in the `survival` column to the top and fill them with values from `count`:

```{r}
library(tidyr)
spread(plum, survival, count)
```

### Discussion

The `spread()` function requires you to specify a *key* column which is used for header names, and a *value* column which is used to fill the values in the output data frame. It's assumed that you want to use all the other columns as ID variables.

In the preceding example, there are two ID columns, `length` and `time`, one key column, `survival`, and one value column, `count`. What if we want to use two of the columns as keys? Suppose, for example, that we want to use `length` and `survival` as keys. This would leave us with `time` as the ID column.

The way to do this is to combine the `length` and `survival` columns together and put it in a new column, then use that new column as a key.

```{r}
# Create a new column, length_survival, from length and survival.
plum %>%
  unite(length_survival, length, survival)

# Now pass it to spread() and use length_survival as a key
plum %>%
  unite(length_survival, length, survival) %>%
  spread(length_survival, count)
```

### See Also

See Recipe \@ref(RECIPE-DATAPREP-WIDE-TO-LONG) to do conversions in the other direction, from wide to long.

See the `unstack()` function for another way of converting from long to wide.


Converting a Time Series Object to Times and Values {#RECIPE-DATAPREP-TIMESERIES}
---------------------------------------------------

### Problem

You have a time series object that you wish to convert to numeric vectors representing the time and values at each time.

### Solution

Use the `time()` function to get the time for each observation, then convert the times and values to numeric vectors with `as.numeric()`:

```{r}
# Look at nhtemp Time Series object
nhtemp

# Get times for each observation
as.numeric(time(nhtemp))

# Get value of each observation
as.numeric(nhtemp)
# Put them in a data frame
nht <- data.frame(year = as.numeric(time(nhtemp)), temp = as.numeric(nhtemp))
nht
```

### Discussion

Time series objects efficiently store information when there are observations at regular time intervals, but for use with ggplot, they need to be converted to a format that separately represents times and values for each observation.

Some time series objects are cyclical. The `presidents` data set, for example, contains four observations per year, one for each quarter:

```{r}
presidents
```

To convert it to a two-column data frame with one column representing the year with fractional values, we can do the same as before:

```{r}
pres_rating <- data.frame(
  year = as.numeric(time(presidents)),
  rating = as.numeric(presidents)
)
pres_rating
```

It is also possible to store the year and quarter in separate columns, which may be useful in some visualizations:

```{r}
pres_rating2 <- data.frame(
  year = as.numeric(floor(time(presidents))),
  quarter = as.numeric(cycle(presidents)),
  rating = as.numeric(presidents)
)
pres_rating2
```

### See Also

The zoo package is also useful for working with time series objects.


```{r echo=FALSE}
# Restore to original
options(knit_print_df_rows = NULL)
```