index.qmd

---
title: "Keeping it Tidy"
subtitle: "*Using the Tidyverse to Organize, Transform, and Visualize Data*<br>Workshop hosted by Meghan Hall, through R-Ladies Paris<br>September 8, 2022 | 1pm EST / 7pm CEST"
from: markdown+emoji
header-includes: |
  <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.1.1/css/all.min.css"
        rel="stylesheet"/>
execute:
  warning: false
---

<a href="https://www.meetup.com/rladies-paris/events/287856868/" class="btn btn-primary">**Register now**</a> <a href="https://meghan.rbind.io/slides/rladiesparis/rladiesparis.html#/title-slide" class="btn btn-primary">**Workshop slides**</a> <a href="https://github.com/meghall06/rladiesparis/blob/master/keeping-it-tidy.R" class="btn btn-primary">**Code file**</a>

## About the workshop

This 90-minute workshop is targeted toward beginners/advanced beginners and will serve as an introduction to the [tidyverse](https://www.tidyverse.org/), a collection of packages designed to aid in the cleaning and wrangling of data. In order to give you the tools to embark upon your own analysis on your own data, we will work through a sample data set and discuss methods for modifying, aggregating, reshaping, and visualizing data.

The event is free and will be held virtually over Zoom, though registration is required!

![](images/image.png)

### Technical details

For those attending: you are free to "code along" throughout the workshop, using the details in the slides and on this page, but you do not have to! All the materials are publicly available, so feel free to just watch and go through the code at your own pace at a later date. Unfortunately, due to time and other logistical constraints of this workshop, there won't be time for individual troubleshooting.

If you would like to be prepared to code along, please have R and RStudio installed on your computer and make sure that the tidyverse is installed with `install.packages("tidyverse")`.

### Website materials

The materials on this website are designed as a companion for the live workshop, not as a comprehensive guide to the entire tidyverse. All the code is here, along with some background on the concepts discussed. The workshop slides are linked above.

The code is interspersed throughout this page, but the full `keeping-it-tidy.R` file is available at the link above.

## R and the `tidyverse`

R is an open-source (that means free!) scripting language for working with data. It helps make your analysis wonderfully efficient and :sparkles: reproducible :sparkles:, and I would highly recommend it over something like Excel for anything beyond very basic data analysis, especially if you expect to repeat that analysis ever again (which you probably will). 

Having a) your data separate from your analysis and b) that analysis codified in a script will help tremendously with reproducibility efforts. That makes it easier for anyone else (including yourself in three months once you have forgotten all the details you swore you'd remember) to rerun your code and make edits, deal with updated data, etc.

### Getting started

You need the R [language](https://cloud.r-project.org/), as well as a place to actually run that R code. I highly recommend [RStudio](https://www.rstudio.com/products/rstudio/download/#download). Both are free to download. There are some basic installation instructions [here](https://stat545.com/install.html).

You use R via *packages*, which hold *functions*, which are just *verbs*. `filter` is an example of a function. The functions we'll see today take the syntax `function(argument)`. Many functions can take multiple arguments, which are separated by a comma.

(You can also write your own functions, but that is beyond the scope of this workshop. See more [here](https://r4ds.had.co.nz/functions.html).)

### Why the tidyverse?

The tidyverse is an opinionated set of packages that work together and are designed for common data science tasks. (By *opinionated*, I mean that the packages have some thoughts on how your data should be structured. We'll discuss the concept of **tidy data** later on in this workshop.) The packages are well-maintained and beginner-friendly, plus they cover almost all of what any beginner and/or intermediate user needs in order to analyze data.

You install packages with the command `install.packages("tidyverse")` (need to do once per R installation on your computer) and then load them with the command `library(tidyverse)` (need to do every session; it is convention to put all of the necessary packages at the top of your coding script).

Since the tidyverse is actually a set of packages, the `library(tidyverse)` command loads nine of them---the eight listed below, in addition to `lubridate`, which is useful for working with dates (I clearly need to update my packages before I take screenshots!).

![](images/packages2.png){fig-alt="The packages loaded with the tidyverse: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, forcats"}

We'll be focusing on the three packages highlighted above, though there are several more packages within the tidyverse family (shown below). You would need to load any of those separately.

![](images/packages3.png){fig-alt="List of other optional tidyverse packages, including broom, rvest"}

#### Common `dplyr` verbs

`dplyr` is perhaps the most useful tidyverse package, just due to how often you'll use its common functions for data manipulation.

- `filter()` keeps or discards rows (aka observations)
- `select()` keeps or discards columns (aka variables)
- `arrange()` sorts data set by certain variable(s)
- `count()` tallies data set by certain variable(s)
- `mutate()` creates new variables
- `summarize()` aggregates data

Both `mutate()` and `summarize()` can be further modified by the addition of the `group_by()` function to specify grouping variables. We'll see examples of all of these during this workshop.

#### Common operators

`<-` is the assignment operator, used for assigning objects like data frames (how data sets are commonly stored in R). I like to think of `<-`, which can be added with the shortcut `option -` in RStudio, as "save as."

`|>` is the *pipe*, used to chain together multiple lines of code. It automatically sends the output of one line into the input of the next line and makes it much easier to read and write code. The keyboard shortcut is `cmd shift m`.

You might also be familiar with `%>%`, which is the original pipe that was native to the tidyverse. The new pipe `|>` works in base R and is therefore a bit more versatile, but for the purposes of this workshop feel free to use either! I am trying to get familiar with using `|>` instead of `%>%`, so that's what you'll see here.

## Workshop data

The data we're using for today's workshop comes from [#TidyTuesday](https://github.com/rfordatascience/tidytuesday), a weekly social data project based on the tidyverse ecosystem. The GitHub repo hosts many interesting data sets to practice with, and [this particular data set](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-02-01/readme.md) comes from the American Kennel Club.

`read_csv()` from the `readr` package (loaded as part of the tidyverse!) imports data of multiple file formats, and the two data sets for this workshop can be read in directly from this website's GitHub repo.

```{r readr}
#| output: false
library(tidyverse)
breed_rank <- read_csv("https://raw.githubusercontent.com/meghall06/rladiesparis/master/breed_rank.csv")
breed_traits <- read_csv("https://raw.githubusercontent.com/meghall06/rladiesparis/master/breed_traits.csv")
```

The first data set, `breed_rank`, lists the popularity rankings over time for 195 different dog breeds (many of the snippets shown throughout are truncated for the purposes of display).

::: {style="text-align: center"}
`breed_rank`
:::

::: {style="font-size: 0.75em"}
```{r breed-rank-kable}
#| echo: false
knitr::kable(breed_rank |> head(4)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::

The second data set, `breed_traits`, has information on 16 different traits, classified from 1 to 5, for those 195 dog breeds.

::: {style="text-align: center"}
`breed_traits`
:::

::: {style="font-size: 0.75em"}
```{r breed-traits-kable}
#| echo: false
knitr::kable(breed_traits |> head(4) |> select(1:5)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::

For today's workshop, since we'll be learning about tidy data, we'll investigate some very **un**tidy dogs. To do so, we'll be specifically focusing on three of these trait variables: shedding level, coat grooming frequency, and drooling level.

## Basic data cleaning

Per the [tidyverse style guide](https://style.tidyverse.org/syntax.html), variable names should use snake case---lower case with underscores between words. This helps with consistency and readability, but it's also technically easier, as any variable names that start with numbers and/or have spaces need to be referred to within `` `back ticks` ``. It's easier to refer to a variable with `shedding_level` instead of `` `Shedding Level` ``, and thankfully we have a function to easily rename all of those variables instead of doing it by hand.

Unfortunately...that function does not live within the tidyverse! It's the only such function we'll be highlighting during this workshop, but it is so helpful that it has to be included.

In the code below, after we've loaded the [janitor](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html) package, the first line uses the assignment operator `<-` to "save as" our `breed_traits` data set. We could give it another name and save it as something else, but for this purpose we're going to overwrite it. The second line applies the `clean_names()` function.

:::{.callout-note icon="false"}
## Note! :rotating_light:
A helpful tip on notation: once a package has been loaded with `library`, you can use the function by itself, like you see here with `clean_names()` and above with `read_csv()`. If you don't want to load the package, you can eliminate that line and instead refer to the function along with its package name, such as `janitor::clean_names()`. (The package still has to be installed, however.)
:::

```{r janitor}
library(janitor)

breed_traits <- breed_traits |> 
  clean_names()
```

The `clean_names()` function neatly converts all variable names to snake case, as shown below.

::: {style="font-size: 0.75em"}
```{r janitor-kable}
#| echo: false
knitr::kable(breed_traits |> head(2) |> select(1:3)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::

### `count()`

If you want to *look* at your data, use `View(breed_traits)` and it will pop up in RStudio in a spreadsheet-like format. But there are ways to programatically look at your data, as well, and one of the most useful functions for doing so is `count()`. It allows you to count the unique values of a variable (or multiple). 

Note that this piece of code does not have an assignment operator! We are applying the `count()` function to the `breed_traits` data set, but the results would appear in your console and would not be saved anywhere. This is useful whenever you don't need to save the output.

```{r count-1}
#| eval: false
breed_traits |> 
  count(shedding_level)
```

This output tallies up our data by the `shedding_level` variable. 109 breeds have a value of 3, 41 breeds have a value of 2, etc. Unfortunately, we have a value of zero, and we know that these variables should have a value of 1, 2, 3, 4, or 5. This is likely an error in the data that should be removed.

::: {style="font-size: 0.75em"}
```{r count-1-kable}
#| echo: false
knitr::kable(breed_traits |> count(shedding_level)) |>
  kableExtra::row_spec(1, bold = T, color = "#6A395B", background = "#F8F2F3") |> kableExtra::kable_styling(full_width = FALSE)
```
:::

### `filter()`

To investigate this further, we can use the `filter()` function, which keeps or discards observations (rows). Here, the function argument says that we want to keep all records with a `shedding_level` value of zero (and we know from our `count()` output above that there should be only one such observation). I'm also adding the `select()` function (which keeps or discards columns) to keep only our columns of interest.

%hanks to the pipe `|>`, we start with our `breed_traits` data set, apply the `filter()` function with that input, apply the `select()` function to the given output, and then the results would be shown in the console.

```{r filter-1}
#| eval: false
breed_traits |> 
  filter(shedding_level == 0) |> 
  select(breed, shedding_level, coat_grooming_frequency, 
         drooling_level)
```

There is in fact only one observation matching this criteria---it looks like something is wrong with the record on Plott Hounds, so we should remove them from our data set (even though [they're quite cute](https://www.google.com/search?q=plott+hound&rlz=1C5GCEM_enUS1004US1004&source=lnms&tbm=isch&sa=X&ved=2ahUKEwi-vqSXsIH6AhVRrYkEHVlDD1EQ_AUoAXoECAIQAw&biw=1440&bih=712&dpr=2)).

::: {style="font-size: 0.75em"}
```{r filter-1-kable}
#| echo: false
knitr::kable(breed_traits |> 
  filter(shedding_level == 0) |> 
  select(breed, shedding_level, coat_grooming_frequency, drooling_level)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::

We can use the `filter()` function once again to make this change. You'll notice in the code below that the assignment operator is back, which means we're making changes to our data set. We'll overwrite the `breed_traits` data frame once again and apply a `filter()` function to keep only the records in which `shedding_level` does *not* equal zero (i.e., all but the Plott Hounds observation). The `!=` operator means *not equal to*.

```{r filter-2}
breed_traits <- breed_traits |> 
  filter(shedding_level != 0)
```

If you were to once again run `breed_traits |> count(shedding_level)`, you'd see that zero is no longer a value of that variable.

### `mutate()`

I mentioned earlier that the focus of our analysis will be **un**tidy dogs and that we'll be focusing on three of the traits. Since those traits are classified from 1 to 5 for each breed, with a higher score denoting a higher level of untidiness, we can add up the scores for all three traits to create a new variable.

The code below uses the assignment operator `<-` again, but this time we're starting with the `breed_traits` data frame and creating a *new* data frame called `untidy_scores`. The `mutate()` function creates this new variable, `untidy_score`, that adds up our three traits of interest, and the `select()` function keeps only two columns from our original data set.

```{r mutate-1}
untidy_scores <- breed_traits |> 
  mutate(untidy_score = shedding_level + 
           coat_grooming_frequency + drooling_level) |> 
  select(breed, untidy_score)
```

Our entire data set now consists of the `untidy_score` for 194 breeds (we had 195 until we dropped the Plott Hounds).

::: {style="font-size: 0.75em"}
```{r mutate-1-kable}
#| echo: false
knitr::kable(untidy_scores |> head(5)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::

### `arrange()`

The `arrange()` function can be useful to quickly sort your data set based on the value of any selected variable(s). `arrange()` defaults to ascending order, as you can see in the code and output on the left, but you can specify descending order by wrapping the variable name within `desc()`, as seen in the code on the right.

::: columns
::: {.column width="50%"}

```{r arrange-1}
#| eval: false
untidy_scores |> 
  arrange(untidy_score)
```

::: {style="font-size: 0.75em"}
::: {style="text-align: center"}
```{r arrange-1-kable}
#| echo: false
knitr::kable(untidy_scores |> arrange(untidy_score) |> head(6)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::

::: {.column width="50%"}
```{r arrange-2}
#| eval: false
untidy_scores |> 
  arrange(desc(untidy_score))
```

::: {style="font-size: 0.75em; text-align: center"}
```{r arrange-2-kable}
#| echo: false
knitr::kable(untidy_scores |> arrange(desc(untidy_score)) |> head(6)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::

## Bar chart

:::{.callout-note icon="false"}
## Note! :rotating_light:
For all of the plots in this workshop, I have included code for the most basic, out-of-the-box version as well as code to make a more custom plot (toggle to the custom tab to see it). The former can be useful whenever you're a) learning and/or b) visualizing data for personal data exploration purposes. But if you're looking to present your plots in any way, you'll likely want to make some customizations. Thankfully, `ggplot` makes that possible! I use a special theme for the custom plots, the code for which is available below.
:::

```{r theme}
#| code-fold: true
#| code-summary: "bonus: custom ggplot theme"
#| output: false
library(showtext) # helps with custom fonts
font_add_google("Prompt", "prompt") # load preferred Google fonts
showtext_auto()

theme_tidy_dog <- function () { 
  theme_linedraw(base_size=13, base_family="prompt") %+replace% 
    theme(
      # justify axis titles
      axis.title = element_text(hjust = 0),
      # backgrounds to match website
      panel.background = element_rect(fill='#F9E0D9', color = NA),
      plot.background = element_rect(fill='#F9E0D9', color=NA),
      legend.background = element_rect(fill="transparent", color=NA),
      legend.key = element_rect(fill="transparent", color=NA),
      # I hate axis ticks and lines :shrug:
      axis.ticks = element_blank(),
      panel.grid.major = element_line(color = "grey90", size = 0.3), 
      panel.grid.minor = element_blank(),
      # make tweaks to the title and subtitle
      plot.title = element_text(size = 15, hjust = 0, vjust = 0.5, face = "bold", 
                                margin = margin(b = 0.2, unit = "cm")),
      plot.subtitle = element_text(size = 10, hjust = 0, vjust = 0.5, 
                                   margin = margin(b = 0.2, unit = "cm")),
    )
}

```

Our previous output showed the six untidiest dogs, along with their untidy score. Let's make a simple bar chart to visualize that data. Our `untidy_scores` data set we created previously has all 194 breeds, but for this purpose we want to create a new data set, `untidy_dogs`, with only the six highest scores.

The `slice_max` function is quite handy for filtering and keeping only the highest values of a certain variable. This function has multiple arguments: the first specifies the variable to filter on (`untidy_score`), the second specifies the number of observations we want to keep, and the third clarifies our stance on ties (here we'll just ignore them).

```{r untidy-dogs}
untidy_dogs <- untidy_scores |> 
  slice_max(untidy_score, n = 6, with_ties = FALSE)
```

We are left with only six observations in this data set, which we can use to make our bar chart.

::: {style="font-size: 0.75em"}
```{r untidy-dogs-kable}
#| echo: false
knitr::kable(untidy_dogs) |> kableExtra::kable_styling(full_width = FALSE)
```
:::

`ggplot2` is the tidyverse package used for making plots---**gg** stands for the grammar of graphics. The base function is `ggplot()`, and there are many ([many, many](https://ggplot2.tidyverse.org/reference/index.html)) associated functions to help construct and customize your plots. 

:::{.callout-note icon="false"}
## Note! :rotating_light:
When you're working with `ggplot` and its related functions, you need to pipe lines together with `+` instead of `|>`. Sometimes you will forget (I still do, and I've been using `ggplot` for years :shrug:), but the error message you get is helpful.
:::

The code for the bar chart below starts with the name of our data frame: `untidy_dogs`. The `ggplot` function starts things off by mapping elements to the **aes**thetic attributes. Here we just have `x` and `y`, but you'll see more later. We only have two variables in this data set, so `untidy_score` will go on the x-axis and `breed` will go on the y. 

We specify the **geom**etric points with the following line: `geom_bar()` signifies a bar chart. The `stat = "identity"` argument is necessary here to communicate that we want the bar lengths to be determined by the variable that we specified. (The standard bar chart just takes a raw count of your data.)

::: {.panel-tabset}

### Basic

```{r bar-basic}
#| fig-height: 4
#| fig-align: center
untidy_dogs |> 
  ggplot(aes(x = untidy_score, y = breed)) +
  geom_bar(stat = "identity")
```

### Custom

```{r bar-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
untidy_dogs |> 
  # can reorder the breeds based on the untidy score
  ggplot(aes(x = untidy_score, y = reorder(breed, untidy_score), 
             label = untidy_score)) +
  # fill colors the bars
  geom_bar(stat = "identity", fill = "#6A395B") +
  # add the data labels
  geom_label(family = "prompt") +
  # ensure the bars go all the way to the axis line
  scale_x_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "The untidiest dogs",
       subtitle = "Based on drooling, shedding, and grooming frequency",
       x = "Untidiness score", y = NULL) +
  theme_tidy_dog()
```

:::

## Tidy data

From the previous section, we know that Bernese Mountain Dogs are among the untidiest of all: they have the highest score of 11. So far we've mostly been working with the `breed_traits` data, but we have a whole other data set, `breed_ranks`. Can we look at how the popularity ranking of Bernese Mountain Dogs has shifted over time?

::: {style="font-size: 0.75em"}
```{r bernese-kable-1}
#| echo: false
knitr::kable(breed_rank |> filter(str_detect(Breed, "Bernese"))) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

I am imagining some kind of dot plot, with the year on the x-axis and the ranking on the y-axis. That seems easy enough based on the data that we have above, but if you were to start constructing that plot, you'd pretty quickly run into a road block. What would your first function look like? `ggplot(aes(x = ??, y = ??))`

We discussed earlier that the tidyverse is opinionated on the topic of **tidy data**, and this particular question is tricky because this data set does not meet the criteria for tidy data:

>There are three interrelated rules which make a dataset tidy:<br>
1. Each variable must have its own column.<br>
2. Each observation must have its own row.<br>
3. Each value must have its own cell.<br>
- [**R for Data Science**](https://r4ds.had.co.nz/tidy-data.html)

How does `breed_ranks` violate these rules? We have a year variable, but that variable does not actually exist as such, in its own column. Shown above is one observation, by dog breed. But that "one" observation is actually eight separate observations: the rank in 2013, the rank in 2014, etc. Each observation needs to have its own row.

### `pivot_longer()`

Thankfully, the tidyverse gives us a handy solution from the `tidyr` package! The current structure of `breed_ranks` is in a *wide* format, and we need it to be in a *long* format. `pivot_longer()` to the rescue.

The code below uses the assignment operator to create a new data frame called `ranks_pivoted` based on `breed_rank`. The `pivot_longer()` function here has three arguments: the columns we want to pivot (remember we need to use back ticks because these variable names have spaces in them), the name for the new column consisting of the previous column headers, and the name for the new column consisting of the previous column *values*.

```{r pivot-longer}
ranks_pivoted <- breed_rank |> 
  pivot_longer(`2013 Rank`:`2020 Rank`,
               names_to = "year",
               values_to = "rank")
```

The output is below. This data is now tidy, with each observation (e.g., the rank in 2013) in its own row and separate columns for each variable.

::: {style="font-size: 0.75em"}
```{r bernese-kable-3}
#| echo: false
knitr::kable(ranks_pivoted |> filter(str_detect(Breed, "Bernese"))) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

We can do a bit more cleaning up of this data set. First, the `rename()` function makes it easy to change the names of variables (the new name comes first, followed by the original)---we can change `Breed` to `breed` to match our other data set. And the `parse_number()` function from `readr` allows us to pull out the integer from our `year` column. This is an example of how you can use `mutate()` to rewrite existing variables in addition to creating new ones.

```{r pivot-longer-2}
ranks_pivoted <- ranks_pivoted |> 
  rename(breed = Breed) |> 
  mutate(year = parse_number(year))
```

::: {style="font-size: 0.75em"}
```{r pivot-longer-2-kable}
#| echo: false
knitr::kable(ranks_pivoted |> filter(str_detect(breed, "Bernese"))) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

:::{.callout-note icon="false"}
## Note! :rotating_light:
There are more advanced pivoting examples [here](https://r4ds.had.co.nz/tidy-data.html#pivoting) and [here](https://dcl-wrangle.stanford.edu/pivot-advanced.html).
:::

Now that our data is appropriately tidy, we can return to the question from the beginning of this section: how has the popularity ranking of Bernese Mountain Dogs has shifted over time? Since I'm only interested in Bernese Mountain Dogs, and our new `ranks_pivoted` data frame contains all the breeds, the code below starts with a `filter()` statement. 

I could use `filter(breed == "Bernese Mountain Dogs")`, but perhaps you don't want to type all that out. The `stringr` [package](https://stringr.tidyverse.org/) (part of the core tidyverse) has many functions for dealing with string (text) data, and `str_detect()` is among the most useful. It returns `TRUE` or `FALSE` as to whether the variable you select (the first argument of the function) contains the string you provide (the second argument of the function). This function within a function will `filter` to only the observations for which our `str_detect` expression is `TRUE`.

Our `ggplot` function assigns `year` to the x-axis and `rank` to the `y-axis`. I've also added `label = rank` to indicate that I also want my `rank` variable to determine the label. Instead of adding a `geom_bar` label like last time, here we've added a `geom_point` layer for dots as well as a `geom_text` layer to add a label. The `vjust` argument within that function specifies the desired vertical justification for the labels to place them below the dots.

## Dot plot

::: {.panel-tabset}

### Basic

```{r dot-basic}
#| fig-height: 3.5
#| fig-align: center
ranks_pivoted |> 
  filter(str_detect(breed, "Bernese")) |> 
  ggplot(aes(x = year, y = rank, label = rank)) +
  geom_point(size = 3) +
  geom_text(vjust = 2)
```

### Custom

```{r dot-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
ranks_pivoted |> 
  filter(str_detect(breed, "Bernese")) |> 
  ggplot(aes(x = year, y = rank, label = rank)) +
  geom_point(size = 3) +
  geom_text(vjust = 2) +
  # flip the y-axis
  scale_y_reverse(limits = c(50, 1)) +
  # specify the breaks on the x-axis
  scale_x_continuous(breaks = seq(2013, 2020, 1)) +
  labs(x = NULL, y = "Popularity Rank",
       title = "Popularity of Bernese Mountain Dogs") +
  theme_tidy_dog()
```

:::

## Line graph

In the previous section we plotted the ranking over time for the Bernese Mountain Dogs. Can we do the same for all six of the untidiest dogs we identified previously in the `untidy_dogs` data set?

These are the two data sets we have:

::: columns
::: {.column width="50%"}
::: {style="text-align: center"}
`untidy_dogs`
:::

::: {style="font-size: 0.75em"}
```{r untidy-dogs-kable}
#| echo: false
```
:::
:::

::: {.column width="50%"}
::: {style="text-align: center"}
`ranks_pivoted`
:::

::: {style="font-size: 0.75em"}
```{r ranks-pivoted-kable}
#| echo: false
knitr::kable(ranks_pivoted |> head(6)) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::

The data we want to plot (the rankings by year) is in the `ranks_pivoted` data set, but that holds all the dog breeds and we only want to include the six breeds we have in the `untidy_dogs` data set.

Thankfully, you can also use the `filter()` function to filter *across* data sets! The code below creates a new data set `untidy_popularity` based on the `ranks_pivoted` data set that keeps only the observations for which the `breed` is included among the list of `breed`s within the `untidy_dogs` data set. There are two more common operators used here: `%in%` allows for multiple options within a list, and `$` refers to a specific variable within another data set.

```{r line-graph}
untidy_popularity <- ranks_pivoted |> 
  filter(breed %in% untidy_dogs$breed)
```

:::{.callout-note icon="false"}
## Note! :rotating_light:
You could get the same results with `filter(breed %in% c("Bernese Mountain Dogs", "Leonbergers", "Newfoundlands", "Bloodhounds", "St. Bernards", "Old English Sheepdogs"))`, but that is a lot of typing (i.e., many opportunities for typos).
:::

To check that this filter worked as we wanted it to, we can apply a `count()` function to our new `untidy_popularity` data set.

```{r count-check}
#| eval: false
untidy_popularity |> 
  count(breed)
```

There are six breeds included (correct), with eight separate observations each (one for each year; correct).

::: {style="font-size: 0.75em"}
```{r count-check-kable}
#| echo: false
knitr::kable(untidy_popularity |> count(breed)) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

We can create a line graph off of this `untidy_popularity` data set with a `geom_line` layer. (I've also included a `geom_point` layer to add points to those lines.) The `year` variable is on the x-axis with `rank` on the y-axis, similar to what we've used before, but in order to get multiple lines for the multiple breeds, I've added the `breed` variable onto both the `group` and `color` aesthetics. This will group the data points by breed and also create a color legend.

::: {.panel-tabset}

### Basic

```{r line-basic}
#| fig-height: 4
#| fig-align: center
untidy_popularity |> 
  ggplot(aes(x = year, y = rank, group = breed, color = breed)) +
  geom_line() +
  geom_point(size = 3)
```

### Custom

```{r line-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
untidy_popularity |> 
  # create a label that applies for the last point only
  mutate(label = ifelse(year == 2020, breed, NA)) |> 
  ggplot(aes(x = year, y = rank, group = breed, color = breed,
             label = label)) +
  geom_line() +
  geom_point(size = 3) +
  # apply the label so I don't have to use a legend
  geom_text(hjust = -0.1, family = "prompt") +
  # change the color scale
  scale_color_viridis_d(option = "A") +
  # control the width of the x-axis (so I have room for labels)
  # and specify breaks
  scale_x_continuous(expand = expansion(mult = c(0.025, 0.5)),
                     breaks = seq(2013, 2020, 1)) +
  # flip the y-axis so better rankings are "higher"
  scale_y_reverse() +
  labs(title = "Popularity over time of the untidiest dogs",
       subtitle = "Based on drooling, shedding, and grooming frequency",
       x = NULL,
       y = "Popularity Rank") +
  theme_tidy_dog() +
  # remove the legend since I used labels instead
  theme(legend.position = "none")
```

:::

## Relational data

We have data on the `untidy_scores` for each breed as well as their popularity over time---can we create a scatter plot of the *average* popularity ranking against the `untidy_scores` for all breeds?

We already have the `untidy_scores`, so let's focus on finding the average popularity ranking. Our `ranks_pivoted` data set holds that information in a tidy way, but there is one observation per year per breed and we need to find the overall average per breed.

The `group_by()`/`summarize()` combination of `dplyr` functions allows you to calculate aggregations by a specified group (or multiple!). Here, the code below groups by `breed` and creates a new variable, `avg_rank`, within a `summarize()` function by taking the mean of the existing `rank` variable.

```{r summ-1}
avg_ranks <- ranks_pivoted |> 
  group_by(breed) |> 
  summarize(avg_rank = mean(rank))
```

The first nine rows of our new `avg_ranks` data set are below. But we have an `NA` value! This is a quirk of the aggregation calculations available within `summarize()`. If there is any missing data among the values being summarized, `NA` will be returned. (From the raw data, you can find out that the American Hairless Terriers were only ranked in 2020.)

::: {style="font-size: 0.75em"}
```{r summ-1-kable}
#| echo: false
knitr::kable(avg_ranks |> head(9)) |> kableExtra::row_spec(9, bold = T, color = "#6A395B", background = "#F8F2F3") |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

Depending on your data, this could be very useful information! But if you are aware of the missing data and still want to calculate the aggregations irrespective of any possible missing values, add `na.rm = TRUE` as an additional argument to your aggregations within `summarize()`.

```{r summ-2}
avg_ranks <- ranks_pivoted |> 
  group_by(breed) |> 
  summarize(avg_rank = mean(rank, na.rm = TRUE))
```

::: {style="font-size: 0.75em"}
```{r summ-1-kable}
#| echo: false
```
:::

We now have all of the data that we need. Our new `avg_ranks` data set holds the average popularity ranking per breed, and the `untidy_scores` data set we created earlier holds each breed's untidy score.

How do we put them together?

::: columns
::: {.column width="50%"}
::: {style="text-align: center"}
`avg_ranks`
:::

::: {style="font-size: 0.75em"}
```{r avg-ranks-kable}
#| echo: false
knitr::kable(avg_ranks |> head(5)) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::
:::

::: {.column width="50%"}
::: {style="text-align: center"}
`untidy_scores`
:::

::: {style="font-size: 0.75em"}
```{r untidy_scores-kable}
#| echo: false
knitr::kable(untidy_scores |> head(5)) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::

Relational data---tables of data that share one or more common elements---is extremely common, and it's worth becoming familiar with the tidyverse functions that allow you to join data tables.

To solve this particular question, we'll use `left_join()`, which is a type of *mutating* join: it matches up observations across tables and moves variables from one to another.

:::{.callout-note icon="false"}
## Note! :rotating_light:
We're focusing on a simple `left_join()` because it is the most common type of join. Other join types are available, and you can read more [here](https://r4ds.had.co.nz/relational-data.html).
:::

The syntax of this example is fairly simple: we're starting with the `avg_ranks` data set we just created and making a new one titled `tidy_and_rank`. The two arguments to the `left_join()` function are the name of the data set we want to get information from (`untidy_scores`) as well as the *key*, or the variable that connects the two data frames. In this case, it's `breed`.

```{r left-join-1}
tidy_and_rank <- avg_ranks |> 
  left_join(untidy_scores, by = "breed")
```

:::{.callout-note icon="false"}
## Note! :rotating_light:
You can have multiple keys: `by = c("var1","var2")` and even keys that have different variable names: `by = c("var_left" = "var_right")`. If you *don't* specify a key, R will default to any variable(s) that have the same name across tables. It is good practice to specify a key no matter what, even in a simple example like this, to avoid mistakes and make it clear what you're intending to do.
:::

A quick peek at our new `tidy_and_rank` data set looks okay...

::: {style="font-size: 0.75em"}
```{r tidy-and-rank-kable}
#| echo: false
knitr::kable(tidy_and_rank |> head(5)) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

...but we can use `count()` again just to make sure that this worked the way we wanted it to.

```{r count-check-2}
#| eval: false
tidy_and_rank |> 
  count(untidy_score)
```

And it looks like we have an `NA` value.

::: {style="font-size: 0.75em"}
```{r count-check-2-kable}
#| echo: false
knitr::kable(tidy_and_rank |> 
  count(untidy_score))|> kableExtra::row_spec(10, bold = T, color = "#6A395B", background = "#F8F2F3") |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

Let's further investigate with a simple `filter()` to see where the issue is. `is.na()` is a handy function to use within `filter()` that will return any observations with an NA value for the specified variable.

```{r na-1}
#| eval: false
tidy_and_rank |> 
  filter(is.na(untidy_score))
```

It's the darn Plott Hounds again. We removed them from the original `breed_traits` data set, the basis of `untidy_scores`, but they were still present in the original `breed_ranks` data set, which is what this is based on.

::: {style="font-size: 0.75em"}
```{r na-1-kable}
#| echo: false
knitr::kable(tidy_and_rank |> 
  filter(is.na(untidy_score))) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

This is essentially the opposite of the operation we just did. We are filtering to keep any observations for which `untidy_score` is *not* (`!`) `NA`.

```{r na-2}
tidy_and_rank <- tidy_and_rank |> 
  filter(!is.na(untidy_score))
```

You can run the `count()` code again if you wish to double-check, but we are good to go.

## Jitter plot

You can create a simple scatter plot in `ggplot` with numeric variables on the x- and y-axes and a `geom_point` layer. But a case like this, where one of our variables is an interval like `untidy_score`, sometimes benefits from adding a bit of *jitter* to help showcase data points that might otherwise overlap in a scatter plot.

That's possible by using a `geom_jitter` layer, where the width of the strip can be controlled by the `width` argument.

::: {.panel-tabset}

### Basic

```{r jitter-basic}
#| fig-height: 4
#| fig-align: center
tidy_and_rank |> 
  ggplot(aes(x = untidy_score, y = avg_rank)) +
  geom_jitter(width = 0.1)
```

### Custom

```{r jitter-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
tidy_and_rank |>  
  ggplot(aes(x = untidy_score, y = avg_rank)) +
  # specify the labels of the x-axis
  scale_x_continuous(breaks = seq(3, 11, 1)) +
  # control the size, transparency, and color of the points
  geom_jitter(size = 3, width = 0.1, alpha = 0.7, color = "#6A395B") +
  # flip the y-axis and control the labels
  scale_y_reverse(breaks = c(200, 150, 100, 50, 1)) +
  labs(title = "Dog breed popularity compared to their untidiness score",
       subtitle = "Ranking based on 2013-2020 data; tidy score based on drooling, shedding, grooming frequency",
       x = "Untidiness Score",
       y = "Average Popularity Rank") +
  theme_tidy_dog()
```

:::

## Dumbbell plot

The last question we'll answer from this data set is: which breeds have had the biggest jump in popularity from 2013 to 2020?

To answer that question, we're only interested in certain observations from `ranks_pivoted`, those with a `year` of 2013 or 2020.

::: {style="font-size: 0.75em"}
```{r ranks-pivoted-3-kable}
#| echo: false
knitr::kable(ranks_pivoted |> head(9)) |> kableExtra::row_spec(c(1,8,9), bold = T, color = "#6A395B", background = "#F8F2F3") |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

All three of the following functions would deliver the same results on this data set:

1. `filter(year %in% c(2013, 2020))`

2. `filter(year == 2013 | year == 2020)`

3. `filter(year == min(year) | year == max(year))` 

The last option :point_up: is the most *robust*, meaning that it would still deliver the intended result if, say, you got an updated version of this data set with more years of data.

### `pivot_wider()`

The code below creates a new data set `rank_change` from `ranks_pivoted` with five different functions applied. After the `filter()` we discussed above, we would be left with two observations per breed, one for 2013 and 2020. But the desired structure for the dumbbell plot would have *one* observation per breed so that the 2013 and 2020 ranks could be plotted separately. That means it's time to pivot again---but this time in the *opposite* direction with `pivot_wider()` instead of `pivot_longer()`.

This `pivot_wider()` function has two arguments to determine where the variable names come from and where the values are from (`year` and `rank`, respectively). `mutate()` will give us a new `change` variable that calculates the difference, and we'll `filter` to only those breeds that were ranked within the top 50 in 2020. And lastly, another application of `slice_max()` will result in the top six breeds with the highest values of `change`.

```{r pivot-wider}
rank_change <- ranks_pivoted |>  
  filter(year == min(year) | year == max(year)) |> 
  pivot_wider(names_from = "year",
              values_from = "rank") |>  
  mutate(change = `2013` - `2020`) |> 
  filter(`2020` <= 50) |>  
  slice_max(change, n = 6)
```

That leaves us with six observations, one per breed, with separate variables for the rank in 2013, the rank in 2020, and the change. 

::: {style="font-size: 0.75em"}
```{r rank-change-kable}
#| echo: false
knitr::kable(rank_change) |>
  kableExtra::kable_styling(full_width = FALSE)
```
:::

We can plot this data with a dumbbell plot, which are useful for showing the change between two points.

:::{.callout-note icon="false"}
## Note! :rotating_light:
To be honest, it's easier to make dumbbell plots with `ggalt::geom_dumbbell()`. But we can use `ggplot` to learn a) how to combine multiple geoms and b) how inherited `aes` works.
:::

We can build these in `ggplot` by adding multiple `geom_` layers. The first `ggplot()` function will contain only the mapped aesthetics that apply to *all* the `geom_` layers. Here, that's only `breed` on the y-axis. Individual `geom_` layers can also have their own aesthetics, in addition to what's inherited from the initial `ggplot(aes())`, as you can see below. One `geom_point` maps the `2013` variable, another maps the `2020` variable, and a `geom_segment` layer creates the horizontal line that connects the two dots.

::: {.panel-tabset}

### Basic

```{r dumbbell-basic}
#| fig-height: 3
#| fig-align: center
rank_change |>  
  ggplot(aes(y = breed)) +
  geom_segment(aes(yend = breed, x = `2013`, xend = `2020`)) +
  geom_point(aes(x = `2013`), color = "#c991b8", size = 3) +
  geom_point(aes(x = `2020`), color = "#6A395B", size = 3) +
  geom_label(aes(x = `2020`, label = `2020`), vjust = -0.5) +
  geom_label(aes(x = `2013`, label = `2013`), vjust = -0.5)
```

### Custom

```{r dumbbell-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
rank_change |>  
  # create a new variable to find the value of the middle
  mutate(middle = `2020` + (change / 2)) |>  
  # order the breeds by their 2020 rank
  ggplot(aes(y = reorder(breed, -`2020`))) +
  geom_segment(aes(yend = reorder(breed, -`2020`), x = `2013`, xend = `2020`), 
               color = "grey20") +
  geom_point(aes(x = `2013`), color = "#c991b8", size = 3) +
  geom_point(aes(x = `2020`), color = "#6A395B", size = 3) +
  geom_label(aes(x = `2020`, label = `2020`), family = "prompt", vjust = -0.5) +
  geom_label(aes(x = `2013`, label = `2013`), family = "prompt", vjust = -0.5) +
  # label the value of the change at the middle of the line
  geom_text(aes(x = middle, label = str_c("+", change)), family = "prompt", 
            vjust = -0.75, size = 3.5) +
  # flip the x-axis
  scale_x_reverse() +
  labs(x = "Popularity Ranking",
       y = NULL,
       title = "Dog breeds with the biggest jump in popularity from 2013",
       subtitle = "Among the top 50 in 2020") +
  theme_tidy_dog() +
  theme(panel.grid.major.y = element_blank(),
        plot.title.position = "plot")
```

:::

## Resources

I hope this workshop has been a helpful introduction to the tidyverse. (If you found it useful and/or you think someone *else* would find it useful, I'd appreciate a Twitter share---I'm [@MeghanMHall](https://twitter.com/meghanmhall)!) But this intro just scratches the surface of the tidyverse, which is a very powerful set of packages that can help you solve lots of problems. These are some of my favorite resources to learn more:

- [**R for Data Science**](https://r4ds.had.co.nz/)

- [**R for Excel users**](https://rstudio-conf-2020.github.io/r-for-excel/)

- [**STAT 545**](https://stat545.com/)

- [**RStudio Education**](https://education.rstudio.com/)

- [**Learn `tidyverse`**](https://www.tidyverse.org/learn/)